Considerations for the Alternate Assessment based on
Modified Achievement Standards (AA-MAS): Understanding
the Eligible Population and Applying that Knowledge to their
Instruction and Assessment
A White Paper Commissioned by the New York Comprehensive
Center in Collaboration with the New York State Education
Department
Project Manager & Editor: Marianne Perie, Center for Assessment
Authors:
Jamal Abedi, University of California, Davis
Christopher Domaleski, Center for Assessment
Stephen Dunbar, University of Iowa
Meagan Karvonen, Western Carolina University
Scott Marion, Center for Assessment
Jim Pellegrino, University of Illinois, Chicago
Marianne Perie, Center for Assessment
David Pugalee, University of North Carolina, Charlotte
Rachel Quenemoen, National Center on Educational Outcomes
Robert Rickelman, University of North Carolina, Charlotte
Catherine Welch, University of Iowa
Reviewers:
Howard Everson, Fordham University
Claudia Flowers, University of North Carolina, Charlotte
Brian Gong, Center for Assessment
Suzanne Lane, University of Pittsburgh
Katherine Ryan, University of Illinois, Urbana-Champaign
Gerald Tindal, University of Oregon
Phoebe Winter, Pacific Metrics
FINAL VERSION: July 16, 2009
The contents of this publication were developed under cooperative agreement S283B050019 with the U. S. Department of Education. However, the
contents do not necessarily represent the policy of the Department of Education, and you should not assume endorsement by the Federal Government.
TABLE OF CONTENTS
CHAPTER 1: INTRODUCTION............................................................................................................................... 1
BACKGROUND ON FEDERAL REGULATIONS GUIDING THE AA-MAS .........................................................................3
SETTING THE STAGE FOR THIS REPORT ...................................................................................................................... 6
Issues Specific to New York State ......................................................................................................................... 7
Organization of the Paper ....................................................................................................................................9
REFERENCES ............................................................................................................................................................ 14
SECTION I: IDENTIFYING AND UNDERSTANDING THE POPULATION ................................................. 15
CHAPTER 2: IDENTIFYING STUDENTS AND CONSIDERING WHY & WHETHER TO ASSESS THEM
WITH AN ALTERNATE ASSESSMENT BASED ON MODIFIED ACHIEVEMENT STANDARDS........... 17
HISTORY OF ASSESSMENT OPTIONS RELATED TO AA-MAS .................................................................................... 18
WHO ARE THE STUDENTS? THE COMPLEXITY OF REGULATORY REQUIREMENTS. ................................................... 22
Disability Categories Overview: Limitations of Categorical Designations to Predict Attainment of Standards23
IEPs and Access to the General Curriculum ...................................................................................................... 26
OPERATIONALIZING THE REGULATORY LANGUAGE: IDENTIFYING WHO MAY BE ELIGIBLE ................................... 29
Student Characteristics and Opportunity to Learn Investigations ...................................................................... 31
STRATEGIES TO IMPROVE OPPORTUNITIES TO LEARN AND IDENTIFY STUDENTS WHO MAY BENEFIT FROM AAMAS ........................................................................................................................................................................ 33
REGULATORY OPTIONS, POLICY PREROGATIVES, AND IMPLICATIONS ..................................................................... 38
Varying State Assumptions, Philosophies and Beliefs, and Their Implications .................................................. 40
Relationship of AA-MAS to the State Options on Accommodations and Alternate Assessments Based on Gradelevel Achievement Standards .............................................................................................................................. 44
QUESTIONS FOR STATES AS THEY CHOOSE, BUILD, AND DEFEND THE VALIDITY OF THEIR APPROACH .................. 46
REFERENCES ............................................................................................................................................................ 48
CHAPTER 3: DEVELOPING STANDARDS-BASED IEPS THAT PROMOTE EFFECTIVE
INSTRUCTION ......................................................................................................................................................... 51
AN OVERVIEW OF THE IEP ....................................................................................................................................... 53
What Does it Mean for an IEP to be “Standards-Based”? ................................................................................ 56
EFFECTIVE CURRICULUM AND INSTRUCTION ........................................................................................................... 58
1. Students are given access to grade-level content ............................................................................................ 59
2. Instruction consists of proven practices that allow teachers to set a trajectory toward performance that can
be evaluated against grade-level achievement standards ................................................................................... 62
3. Instruction is flexible and responsive to student progress .............................................................................. 64
4. The instructional program minimizes barriers and provides the full range of supports necessary to promote
growth ................................................................................................................................................................. 66
USING THE IEP TO PROMOTE QUALITY INSTRUCTION.............................................................................................. 68
Reviewing Present Levels of Performance and Identifying Need ....................................................................... 68
Selecting and Writing Academic Goals............................................................................................................... 69
Monitoring Goal Attainment ............................................................................................................................... 70
Choosing and Designing Supports ...................................................................................................................... 71
STATE GUIDANCE TO IEP TEAMS............................................................................................................................. 73
Decisions about Participation in AA-MAS ......................................................................................................... 73
Decisions about Accommodations ...................................................................................................................... 75
Other Guidance................................................................................................................................................... 77
SYSTEMS FOR MONITORING IEPS............................................................................................................................. 79
VALIDITY EVIDENCE ................................................................................................................................................ 80
CONCLUSION ............................................................................................................................................................ 82
Challenges and Caveats...................................................................................................................................... 83
REFERENCES ............................................................................................................................................................ 86
CHAPTER 4: THE CHALLENGES OF CONCEPTUALIZING WHAT LOW ACHIEVERS KNOW AND
HOW TO ASSESS THEIR COMPETENCE.......................................................................................................... 90
TWO CRITICAL ISSUES FOR CONCEPTUALIZING STUDENT ASSESSMENT .................................................................. 91
The Curriculum-Instruction-Assessment Triad ................................................................................................... 91
Assessment as a Process of Reasoning from Evidence ....................................................................................... 93
FUNDAMENTAL COMPONENTS OF COGNITION AND SOME IMPLICATIONS FOR ASSESSMENT.................................... 95
Working Memory ................................................................................................................................................ 96
Metacognition ..................................................................................................................................................... 96
Types of Knowledge and Processes of Acquisition ............................................................................................. 97
The Role of Practice and Feedback .................................................................................................................. 100
The Role of Social Context, Cultural Norms, and Student Beliefs .................................................................... 101
Some Implications for Low Achieving/performing Students ............................................................................. 102
DOMAIN SPECIFIC ASPECTS OF COGNITION AND LEARNING .................................................................................. 104
K-8 Reading ...................................................................................................................................................... 106
K-8 Mathematics ............................................................................................................................................... 110
Is All This Detail Necessary? ............................................................................................................................ 123
IMPLICATIONS FOR ASSESSMENT DESIGN .............................................................................................................. 125
Assessment Purposes, Levels & Timescales ...................................................................................................... 127
Implications of Cognitive Theory & Research for Classroom Assessment ....................................................... 129
Implications of Cognitive Research and Theory for Large-scale Assessment .................................................. 131
The Design of Observational Situations ........................................................................................................... 133
Validation ......................................................................................................................................................... 136
Reporting .......................................................................................................................................................... 137
Fairness ............................................................................................................................................................ 137
CONCLUSIONS & CAVEATS .................................................................................................................................... 138
REFERENCES .......................................................................................................................................................... 142
SECTION II: TEST DESIGN: UNDERSTANDING CONTENT AND ACHIEVEMENT STANDARDS AND
INCORPORATING APPROPRIATE ITEM MODIFICATIONS ..................................................................... 152
CHAPTER 5: UNDERSTANDING THE CONTENT ......................................................................................... 154
WHAT IS THE CURRICULUM?.................................................................................................................................. 155
HOW ARE CONTENT STANDARDS DEVELOPED? ..................................................................................................... 156
DIFFERENCE BETWEEN CONTENT STANDARDS AND ACHIEVEMENT STANDARDS .................................................. 161
BARRIERS IN PROVIDING ACCESS TO THE GENERAL CURRICULUM ........................................................................ 162
THE CONTENT STANDARDS FOR MATHEMATICS .................................................................................................... 166
Sampling Mathematics ...................................................................................................................................... 169
THE CONTENT STANDARDS FOR ENGLISH/LANGUAGE ARTS ................................................................................. 170
ALIGNING STATE STANDARDS TO ASSESSMENTS ................................................................................................... 173
GUIDANCE ON TEST SPECIFICATIONS ..................................................................................................................... 173
JUDGING THE ALIGNMENT BETWEEN EXPECTATIONS AND MODIFIED ASSESSMENTS ............................................ 176
CONSIDERING COGNITIVE COMPLEXITY IN ASSESSMENTS ..................................................................................... 179
Levels of Cognitive Complexity in Mathematics ............................................................................................... 183
ITEM MODIFICATIONS ............................................................................................................................................ 185
HOW DO WE LINK CONTENT TO CURRICULUM AND INSTRUCTION APPROPRIATE FOR THIS POPULATION? ........... 190
REFERENCES .......................................................................................................................................................... 192
CHAPTER 6: DEVELOPING ITEMS AND ASSEMBLING TEST FORMS FOR THE ALTERNATE
ASSESSMENT BASED ON MODIFIED ACHIEVEMENT STANDARDS (AA-MAS).................................. 195
BEST PRACTICE IN ITEM DEVELOPMENT AND FORMS ASSEMBLY .......................................................................... 197
Test Domain ...................................................................................................................................................... 198
Test Specifications ............................................................................................................................................ 199
Item Development ............................................................................................................................................. 202
TEST SPECIFICATIONS, ITEM DEVELOPMENT, FORMS ASSEMBLY, AND ITEM-LEVEL STATISTICS FOR THE AA-MAS
............................................................................................................................................................................... 205
Special Considerations for Test Specifications ................................................................................................. 208
Special Considerations for Item Development .................................................................................................. 212
Special Considerations for Forms Assembly .................................................................................................... 221
Special Considerations for Evaluating Statistical Characteristics of AA-MAS Items ...................................... 223
Validating the AA-MAS..................................................................................................................................... 230
REFERENCES .......................................................................................................................................................... 232
CHAPTER 7: DEVELOPING MODIFIED ACHIEVEMENT LEVEL DESCRIPTORS AND SETTING
CUT SCORES .......................................................................................................................................................... 235
DEFINING ACHIEVEMENT STANDARDS................................................................................................................... 236
Numbers ............................................................................................................................................................ 236
Names ............................................................................................................................................................... 237
Descriptors ....................................................................................................................................................... 238
Cut Scores ......................................................................................................................................................... 238
DEFINING PROFICIENCY ......................................................................................................................................... 239
Applying Theories of Learning to Modified Achievement Level Descriptors ................................................... 243
Procedures for Drafting Modified Achievement Level Descriptors .................................................................. 245
SETTING CUT SCORES ............................................................................................................................................ 250
Test-Based Approach ........................................................................................................................................ 251
Other Standard-Setting Approaches ................................................................................................................. 256
Linking Tests through Cut Scores ..................................................................................................................... 260
FINAL CONSIDERATIONS ........................................................................................................................................ 260
Documentation .................................................................................................................................................. 261
Validation ......................................................................................................................................................... 262
REFERENCES .......................................................................................................................................................... 265
SECTION III: TECHNICAL CONSIDERATIONS AND PRACTICAL APPLICATIONS ........................... 267
CHAPTER 8: COMPARABILITY ISSUES IN THE ALTERNATE ASSESSMENT BASED ON MODIFIED
ACHIEVEMENT STANDARDS FOR STUDENTS WITH DISABILITIES .................................................... 269
RATIONALE ............................................................................................................................................................ 269
CHALLENGES IN EVALUATING COMPARABILITY .................................................................................................... 272
APPROACHES TO COMPARABILITY ......................................................................................................................... 274
Content and Construct Comparability between AA-MAS and General Assessments........................................ 275
Psychometric Comparability between AA-MAS and General Assessments ...................................................... 278
Scale and Score Comparability between AA-MAS and General Assessments .................................................. 284
Assessing the Level of Linguistic Complexity of the AA-MAS and General Assessment Test Items ................. 287
Basic Text Features .......................................................................................................................................... 290
Comparability in terms of Depth of Knowledge (DOK) ................................................................................... 291
Comparability issues in the accommodated assessments for students with disabilities .................................... 292
RECOMMENDATIONS TO NYSED FOR ESTABLISHING COMPARABILITY BETWEEN AA-MAS AND GENERAL
ASSESSMENTS ........................................................................................................................................................ 295
GUIDELINES FOR EXAMINING COMPARABILITY OF AA-MAS: HOW MUCH COMPARABILITY IS NECESSARY? ...... 297
Necessary Features for Establishing Comparability between AA-MAS and General Assessments .................. 298
Desired Features for Establishing Comparability Between AA-MAS and General assessments...................... 299
REFERENCES .......................................................................................................................................................... 301
CHAPTER 9: CONSTRUCTING A VALIDITY ARGUMENT FOR ALTERNATE ASSESSMENTS BASED
ON MODIFIED ACHIEVEMENT STANDARDS (AA-MAS) ........................................................................... 308
FRAMEWORK.......................................................................................................................................................... 309
Why an Argument? ............................................................................................................................................ 310
KANE’S ARGUMENT-BASED FRAMEWORK ............................................................................................................. 310
The Interpretative Argument ............................................................................................................................. 312
Values and Consequences ................................................................................................................................. 313
GUIDING PHILOSOPHY, PURPOSES, AND USES ........................................................................................................ 314
A THEORY OF ACTION: THE STARTING POINT FOR AN INTERPRETATIVE ARGUMENT ............................................ 316
Example # 1: The AA-MAS allows eligible students to show that what they know may be comparable to similar
levels on the general assessment. ...................................................................................................................... 317
Example #2: The AA-MAS will better align with current learning opportunities and beliefs about how eligible
special education students learn grade-level academic content. ...................................................................... 318
PRIORITIZING THE VALIDITY EVALUATION QUESTIONS ......................................................................................... 322
CLASSES AND SOURCES OF EVIDENCE ................................................................................................................... 323
Who Are the Students and How Do They Learn?.............................................................................................. 324
Evidence Based on Test Content ....................................................................................................................... 325
Evidence Based on Response Processes ........................................................................................................... 326
Internal Structure .............................................................................................................................................. 327
Evidence Based on Relations to Other Variables ............................................................................................. 328
Evidence Based on Consequences of Testing .................................................................................................... 329
SYNTHESIS AND EVALUATION................................................................................................................................ 330
Dynamic Evaluation.......................................................................................................................................... 331
REFERENCES .......................................................................................................................................................... 332
CHAPTER 10: OPERATIONAL AND ACCOUNTABILITY ISSUES ............................................................. 334
BACKGROUND AND CONTEXT FOR ACCOUNTABILITY............................................................................................ 334
Federal Regulations .......................................................................................................................................... 336
Relationship to Existing Assessments ............................................................................................................... 337
Grades and Content Areas for AA-MAS ........................................................................................................... 338
Participation Options and Evidence ................................................................................................................. 341
ACCOUNTABILITY SYSTEM BACKGROUND............................................................................................................. 345
Evaluating the Reliability of Accountability Determinations ............................................................................ 347
OPERATIONAL CONSIDERATIONS FOR NEW YORK STATE’S ACCOUNTABILITY SYSTEM ........................................ 350
Evaluating New York State’s Accountability Determinations ........................................................................... 354
Reporting .......................................................................................................................................................... 357
Diploma Eligibility ........................................................................................................................................... 358
CONCLUSION AND RECOMMENDATIONS ................................................................................................................ 360
REFERENCES .......................................................................................................................................................... 364
APPENDIX A: INDIVIDUALS INVOLVED IN THIS PROJECT .................................................................... 366
APPENDIX B: LIST OF INTERNET RESOURCES FOR EFFECTIVE CURRICULUM AND
INSTRUCTION ....................................................................................................................................................... 367
APPENDIX C: TOOL FOR STATE POLICYMAKERS.................................................................................... 369
GLOSSARY………………………………………………………………………………………………………...373
CHAPTER 1
INTRODUCTION
Marianne Perie
In April 2007, as part of its governance of the federal No Child Left Behind Act, the U.S.
Department of Education (USED) released new regulations that allowed for the use of an
alternate assessment based on modified achievement standards (AA-MAS). These regulations
supplemented the most recent Elementary and Secondary Education Act legislation regarding
the development of grade-level assessments and alternate assessments based on alternate
achievement standards. States were required to develop these assessments in reading and
math, grades 3–8 plus one grade in high school, and use those to hold schools and districts
accountable for student progress. States could use this new assessment for students with
disabilities to count up to two percent of students as Proficient for purposes of Adequate Yearly
Progress (AYP). These regulations were in response to state concerns that there were students
with disabilities who were not able to show proficiency on the general assessment and yet
would not be assessed appropriately by the alternate assessment based on alternate
achievement standards either; they fell into the ―gap‖ between the two assessments. It
supplemented the option of developing alternate assessments based on grade-level
achievement standards which only a few states used.
In spring 2008, six states submitted their ―modified‖ assessments for Peer Review to
determine whether they could be used for purposes of AYP. In June 2008, USED released a
paper written by Janet Filbin that describes the issues raised during the Peer Review of the six
state alternate assessments based on modified achievement standards. None of the states
received approval for their AA-MAS, but lessons learned from the review of their designs
provided much information for all states.
Considerations for an AA-MAS
Page 1
In Fall 2008, the New York Comprehensive Center (NYCC) applied for and received
supplemental funding from USED to collaborate with the New York State Department of
Education (NYSED) to study these issues further. Their proposal involved convening national
research experts to provide guidance to NYSED regarding the feasibility of developing an
alternate assessment based on modified achievement standards and advice on how to design
standards and assessments for students with disabilities who are part of the ―2%‖ gap. NYCC
partnered with the National Center for the Improvement of Educational Assessment (Center for
Assessment) to convene a panel of experts and write a white paper on this topic.
In January 2009, a group of 17 research experts were identified and brought together in
New York City to discuss the issues surrounding the design and development of an AA-MAS.
This report is a result of that meeting. Nine of the experts authored chapters of this report and
the remaining eight experts reviewed the chapters and provided support and guidance to the
authors. Each author is a nationally recognized expert on the issue discussed in his/her chapter
and the reviewers possess similar qualifications allowing for a full review both within and across
chapters. A full list of the experts involved in this study can be found in Appendix A.
The purpose of this report is to describe the primary challenges in developing an AAMAS based both on the Filbin (2008) paper and the panel‘s own experiences. It provides a
research-based analysis of the design and development issues and focuses on the theory
behind each issue. In addition, this report explores the existing research and best practices in
identifying and assessing these students. Specifically, the goal of this report is to make
recommendations to NYSED about developing an AA-MAS, with the expectation that these
recommendations could be generalized to other states. The authors each approached their
chapters with an intention to help states think through the issues, make appropriate decisions
regarding the allocation of resources, and ultimately improve opportunities for students with
disabilities.
Considerations for an AA-MAS
Page 2
Upon the completion of the second drafts of these chapters, the expert panel recognized
the utility of the information beyond the application to fulfilling the federal regulations regarding
an alternate assessment for purposes of accountability. Much of the discussion in this report
relates to instructing and assessment all low achievers and specifically on low achievers with
disabilities. Therefore, even if the regulations were rescinded, the panel believes the information
provided in this report will continue to be applicable as the field works to improve our knowledge
and understanding of how low achieving students with disabilities—and perhaps those without
disabilities who are also struggling with grade-level achievement standards—learn, organize
information, and communicate their understanding.
Background on Federal Regulations Guiding the AA-MAS
The 2001 reauthorization of the Elementary and Secondary Education Act (ESEA),
known as No Child Left Behind (NCLB), required that all states assess all students in reading1
and mathematics in grades 3 through 8 plus one grade in high school. In addition, they were
required to assess all students in science at least once in elementary, middle, and high school.
A minimum of three performance levels had to be developed for each test—one defining
proficiency, one above that and one below that—with the goal of all students reaching
proficiency by 2014. Up to 1% of students with the most significant cognitive disabilities could be
categorized as proficient using an alternate assessment based on alternate achievement
standards (AA-AAS). States also had the option of developing an alternate assessment based
on grade-level achievement standards (AA-GLAS) for those students who were capable of
performing on grade level but needed a format other than the traditional multiple-choice test to
demonstrate their knowledge and skills.
1
The law requires an assessment in reading, although some states include reading in a broader
English Language Arts (ELA) assessment and use that to meeting NCLB requirements.
Considerations for an AA-MAS
Page 3
Some state and local leaders argued that there were still some students with disabilities
who were not being well served by the assessment program because they were ineligible to
take the AA-AAS and unable to access all of the content and skills assessed on either gradelevel assessment. Prior to NCLB, many states used out-of-level testing to assess certain
students with disabilities. For example, a student who was in grade 8 based on their age, but
being instructed significantly below the 8th-grade level, might be administered a grade 6 test.
NCLB ended that practice and enforced the IDEA principle that students should have access to
the general curriculum by holding schools accountable for teaching students grade-level
content.
The regulations released in April 2007 allowed states to develop an alternate
assessment based on modified achievement standards and use it for accountability purposes.
Before these regulations were released, students with disabilities had the option of taking: (1) a
general grade-level assessment, with or without accommodations; (2) an alternate assessment
based on grade-level achievement standards; or (3) an alternate assessment based on
alternate achievement standards. Critics argued that none of these options seemed appropriate
for certain groups of students with disabilities. They wanted an additional option for an
appropriate assessment of what these students know and can do across all the content
standards not only for accountability purposes but also to provide information that could help
guide instruction. The AA-MAS was intended to fall between an AA-AAS and a general gradelevel assessment to provide a more appropriate measure of these students' performance
against academic content standards for the grade in which they are enrolled. The regulations
state that ―there is a small group of students whose disability has precluded them from
achieving grade-level proficiency and whose progress is such that they will not reach gradelevel proficiency in the same time frame as other students‖ (34 C.F.R. Part 200). However, this
statement has raised countless questions as state policymakers try to determine who this ―small
Considerations for an AA-MAS
Page 4
group‖ is within that larger group of students who are not eligible for the AA-AAS but who are
not performing well on the grade-level assessment.
An emphasis of the regulations and the nonregulatory guidance was that this
assessment must be challenging for these students. The assessments are required to cover the
same breadth and depth as the other grade-level assessments. Modified achievement
standards were described as being challenging for eligible students although defining a less
rigorous expectation of mastery of grade-level academic content standards. They could not be
linked to content from a lower grade level or exclude content standards that were assessed by
the grade-level general assessment. States also were not permitted to apply their new modified
achievement standards to that same general assessment; a new assessment must be
developed. Students assessed using the AA-MAS must have access to grade-level content and
be working towards achieving grade-level goals. However, it is important to note that the
regulations do not require states to develop this assessment and provide flexibility for states to
develop an AA-MAS only for a particular grade or subject.
Eligible students include students with a disability in any of the 13 disability categories
defined in the Individuals with Disabilities Education Act (IDEA). To determine eligibility, the
guidance stipulates:
There must be objective evidence demonstrating the student‘s disability has precluded
the student from achieving grade-level proficiency.
The student‘s progress to date in response to appropriate instruction, including special
education services designed to meet the individual needs of the student, is such that
even if significant growth occurs, the IEP team is reasonably certain the student will not
reach grade-level proficiency within the year covered by the IEP.
The student‘s IEP must include goals that are based on the academic content standards
for the grade in which the student is enrolled.
Considerations for an AA-MAS
Page 5
States must establish participation guidelines for IEP teams to use to match the student
to the appropriate test, typically the grade-level assessment with or without accommodations, an
AA-GLAS, an AA-MAS or an AA-AAS. The guidelines must include criteria based on evidence
that demonstrate the student meets the three eligibility requirements bulleted above. Students
should not be locked into taking an AA-MAS every year, but must have the opportunity to move
from the AA-MAS to a general or alternate grade-level assessment from one year to the next.
Also, a student might take the AA-MAS in one subject but the general assessment in another.
All of these decisions would be made each year by the student‘s IEP team.
The 2007 regulations allow states to count up to 2% of students as proficient using an
alternate assessment based on modified achievement standards. The number ―2%‖ was
considered to be a ―reasonable cap‖ based on the research available to the federal government.
While states were developing the AA-MAS, they were allowed to use a ―2% proxy‖ for interim
flexibility. That is, they could calculate the percentage of students with disabilities that is
equivalent to 2.0 percent of all students assessed in a particular school or district. This proxy
could then be added to the percentage of students with disabilities who scored proficient or
above and used in making AYP determinations. This interim flexibility was first introduced in
2005 when the announcement was made that the USED was working on regulations for the AAMAS, and it is set to expire after the 2008-09 accountability year. Using an AA-MAS, states
could count up to 2% of students as proficient, replacing the 2% proxy but still providing
additional flexibility for state, district, and school accountability.
Setting the Stage for this Report
A driving question for New York State (and other states) is whether the development of
this assessment will yield useful information to guide instruction and be cost effective. More
specifically, in times of budget cutbacks, how can the limited funding available be best allocated
to support the learning of these students? The first issue in answering this question is
Considerations for an AA-MAS
Page 6
determining who ―these students‖ are. Subsumed within that question is the possibility of
expanding this report beyond the current federal regulations, focusing on students who may not
be receiving grade-level content. In addition, it is important to consider the challenges of using
the data to ―guide instruction‖ when the primary focus of many of these assessments is simply
to provide an additional measure for purposes of accountability. While more description
regarding the students and the uses of the assessment will be provided more fully in the
following chapters, it is important to provide a context and lay out the assumptions that this
report will follow regarding fidelity to the federal regulations and guidance.
Authors were encouraged to adhere to the law laid out in the most recent IDEA and
ESEA reauthorization and to stay true to the federal principles. However, if there were aspects
of the April 2007 regulations permitting the development of the AA-MAS that authors found too
constrictive, they were encouraged to address areas for change. Recognizing that people
disagree on the assumptions behind this 2% population, this paper is written from the
assumption that all students can learn (and should be taught) grade-level content standards
with appropriate instruction and support. However the degree to which all students achieve the
grade-level content standards may vary. Of course, even these assumptions lead to more
questions about whether students are taught the exact same content or whether it will be
modified as well as the time frame in which they are expected to learn the content. These more
specific issues will be addressed in the first section of the report, but the authors started from
these basic principles and assumptions.
Issues Specific to New York State
In elementary and middle school, the New York State Education Department (NYSED)
requires NCLB testing for English and Mathematics in grades 3–8, and science assessment in
grades 4 and 8. In addition to the NCLB-required tests, New York State assesses social studies
Considerations for an AA-MAS
Page 7
in grades 5 and 8; and technology education in grade 8. All tests are comprised of both multiplechoice and constructed-response items. Student performance is divided into four levels:
Level I: ‗Not Meeting Learning Standards‘
Level II: ‗Partially Meeting Learning Standards‘
Level III: ‗Meeting Learning Standards‘
Level IV: ‗Meeting Learning Standards with Distinction‘
New York State counts Level III and IV as Proficient for purposes of AYP.
At high school, NYSED administers Regents Examinations that are tied to the high
school diploma a student receives. Currently, students are required to take Regents
Examinations in English, mathematics, social studies, and science. The diploma a student
receives is tied both to the courses taken and the score on the Regents Examinations.
Currently, a student with the most significant cognitive disabilities typically receives an IEP
diploma. Students who do not achieve the level of performance necessary to earn a Regents
diploma can earn a Local diploma; a score of 65 or higher qualifies a student for a Regent‘s
diploma; a score of 55-64 qualifies a student for a Local diploma. For purposes of AYP at the
high school level, the Regents Examinations in Comprehensive English and Integrated Algebra
are used. Instead of using the graduation cut scores, NYSED established three separate
achievement levels for Comprehensive English and Integrated Algebra to be used solely for the
purpose of calculating AYP. The Regents Examinations also are comprised of both multiplechoice and constructed-response items.
NYSED is primarily interested in exploring the AA-MAS in English and Mathematics. At
this point, it has received federal approval to develop AA-MAS only in grades 3–8; however, for
purposes of this report, authors were asked to consider the full range of K–12 assessments.
NYSED wants to follow the regulations of assessing the same content on the AA-MAS as on its
general assessment, to better understand how to make the assessment less rigorous, and to
Considerations for an AA-MAS
Page 8
learn how to modify the achievement standards while maintaining the reliability and validity of
the results. Specific questions raised by NYSED include:
1. Which students are best served by this assessment?
2. How different are they from the rest of the special education population?
3. What is an ―appropriately challenging‖ achievement standard?
4. Which modifications make the most sense in the context of the AA-MAS?
5. How do the modifications affect the validity and reliability of the interpretation?
6. What is the credential that is most appropriate for students participating in the AA-MAS
and what does it lead to in terms of post-secondary potential?
Most of the issues raised by NYSED are general issues that many state policymakers
are confronting, and many of these match closely with the issues raised by Filbin (2008). The
one exception is the last question regarding student credentialing. Because the Regents
examinations are used both for AYP purposes (thus open to modification) and graduation
requirements, NYSED raises a good question regarding whether using an AA-MAS would limit a
student‘s opportunity to receive a Regents diploma. The nonregulatory guidance clarifies that no
assumption is made about the comparability between the AA-MAS and an assessment required
for graduation. However, states may not require students to enter a non-diploma track if they
take an AA-MAS. However, since the diploma in New York State is based primarily on the score
obtained on a Regents exam, modifying that exam does seem to ensure the student will be
tracked to a lower-level diploma. This question is a policy issue rather than a technical question,
and while it will be addressed within this report, it is ultimately an issue that will need to be
decided by NYSED.
Organization of the Paper
The direction to the expert panel from the Assistant Commissioner of the Office of
Standards, Assessments, and Reporting in NYSED was to provide information on current
Considerations for an AA-MAS
Page 9
research and best practices and to make recommendations on the steps NYSED should take
towards designing an AA-MAS (or to recommend not to do it at all). As a first step, the expert
panelists reviewed Filbin (2008) to determine key issues. Filbin identified five areas that were
challenging to states:
1. Identifying students eligible to take the AA-MAS.
2. Providing guidelines for writing standards-based IEPs and then monitoring the
implementation of those guidelines.
3. Designing an assessment based on grade-level content standards that is of an
appropriate difficulty and depth of knowledge for this population.
4. Determining the relationship between the AA-MAS, the general assessment, and the
alternate assessment based on alternate achievement standards (AA-AAS).
5. Writing appropriate modified achievement level descriptors.
The expert panel used these five issues as a starting point during the initial planning
meeting. Later, the specific questions from NYSED were added and divided among the different
chapters as appropriate. Ultimately, though, this report was organized into three sections
focusing on different aspects of designing and developing the AA-MAS, with three chapters in
each section. Within the ten chapters (including this introduction), all of the issues described by
Filbin (2008) and the questions raised by NYSED are addressed.
Section I. Identifying and Understanding the Population. The first issue raised by Filbin
(2008) and asked by NYSED involves determining who should take this assessment. During the
initial expert panel meeting, the experts decided that the issues of identifying the students were
wrapped up in the NRC assessment triangle of assessment, instruction, and cognition
(Pellegrino, Chudowsky, & Glaser, 2001). Thus it was decided that this first section should
discuss the issues of identifying the students and understanding their cognitive abilities,
including the interaction between instruction, cognition, and assessment. This section could be
titled: who are the students, vis à vis the curriculum?
Considerations for an AA-MAS
Page 10
Chapter 2, written by Rachel Quenemoen, focuses on identifying students appropriate
for this assessment. She provides a policy context and summarizes research related to the
teaching and learning of students with disabilities. Most importantly, she lays out a framework
for state policymakers in considering how to identify students who might benefit most from an
alternate assessment based on modified achievement standards. Included in this framework is
a discussion on improving student access to grade-level curriculum and providing more
opportunities to learn.
Chapter 3, by Meagan Karvonen, takes this argument one step further by examining
various instructional strategies for teaching students with disabilities, with a focus on the issue
of writing standards-based IEPs. She discusses the importance of aligning the curriculum with
the grade-level content standards and providing supports for students to access this curriculum
within the IEPs. She describes ways to promote quality of instruction and provide guidance to
IEP teams.
Finally, in chapter 4, Jim Pellegrino provides information on the third vertex of the
triangle: student cognition. He discusses the importance of understanding student learning
characteristics and cognitive processes in assessment, and goes on to describe possible
sources of differences among students that have implications for learning, instruction, and
assessment.
Section II. Test Development. This next section starts the discussion on test
development. The main question the authors wrestled with was how to make the assessment
more accessible for students with a wide range of disabilities but maintain the reliability and
validity of the results. A deep understanding of the content and test design was necessary as
well as an understanding of what is meant by modified achievement standards.
Thus, chapter 5, written by Robert Rickelman and David Pugalee, begins this section
with a discussion of the content domains of reading and mathematics. They continue the
discussion from the first section regarding aligning curriculum, instruction, and assessment but
Considerations for an AA-MAS
Page 11
focus specifically on issues related to reading and mathematics. They describe important issues
regarding sampling the domain and making the content accessible to students with disabilities.
Chapter 6, by Stephen Dunbar and Catherine Welch, then moves the discussion into the
issues of test development. They discuss the challenge of developing items and test forms in
reading and mathematics that better match the learning characteristics of the population
identified for the AA-MAS, focusing on reducing the difficulty while maintaining the reliability of
the assessment.
Next, the issue of developing modified achievement level descriptors is discussed by
Marianne Perie in Chapter 7. This chapter focuses on determining how the modified
achievement standards ―fit‖ between the grade-level achievement standards and the alternate
achievement standards, and provides practical advice for writing achievement level descriptors
and setting cut scores, discussing the theory behind each.
Section III. Technical Considerations and Practical Applications. The third section of this
report addresses three issues related to the technical quality and use of the assessments:
examining the validity of these assessments, determining the comparability of these
assessments to the general assessment, and understanding how these assessments will be
operationalized and used in a state accountability system.
In chapter 8, Jamal Abedi explores issues of comparability of the AA-MAS with the
general assessments and grade-level achievement standards. The chapter is written from the
premise that issues concerning comparability of assessments are of paramount importance for
inclusion, as states may not produce valid outcomes if the degree of comparability across the
assessments has not been clearly explored and described. Likewise, descriptions of any
differences in interpretations of achievement levels of the same name across each type of
assessment need to be provided. This chapter examines content and construct comparability,
linguistic comparability, psychometric comparability, and the use of accommodations to achieve
comparability.
Considerations for an AA-MAS
Page 12
Chapter 9, by Scott Marion, focuses on creating a validity argument for alternate
assessments based on modified achievement standards. This chapter describes the importance
of stating the policymakers‘ values explicitly and laying out a theory of action for the purpose
and use of these assessments. It then goes on to describe types of validity evidence that can be
gathered throughout the test development process and beyond and used to evaluation the
assumptions in the validity argument.
Finally, in chapter 10, Chris Domaleski provides practical advice and a theoretical
discussion of using these assessments in state accountability systems. The focus of this
chapter is how the AA-MAS fits into state accountability systems, but specific advice is given on
how to develop participation guidelines, evaluate the reliability and validity of accountability
decisions made using these assessments, operationalize the ―2% cap,‖ create score reports,
and use the results to determine diploma eligibility.
At the end of this white paper are three appendices followed by a glossary of terms.
Appendix A simply provides information on the team that developed this white paper as the
chapters were shaped by the entire expert panel. Appendix B provides suggested resources,
available on the Internet, for effective curriculum and instruction. Appendix C is a tool that state
policymakers can use as they are considering whether and how to develop an AA-MAS. This
tool consists of questions for state policymakers and educators to consider at each phase of
assessment development as well as a link back to resources within this report that will inform
the discussions. Many of the questions come from the validity framework that guides much of
the discussion in these chapters (c.f., Marion, 2007). Finally, a glossary of terms is included that
encompasses vocabulary used in both the assessment and disabilities worlds.
Considerations for an AA-MAS
Page 13
References
Filbin, J. (2008). Lessons from the initial peer review of alternate assessments based on
modified achievement standards .Washington DC: U.S. Department of Education, Office
of Elementary and Secondary Education (OESE) Student Achievement and School
Accountability Program.
Marion, S. (2007). A technical design and documentation workbook for assessments based on
modified achievement standards. Minneapolis: National Center on Educational
Outcomes. Retrieved April 2009 from
http://cehd.umn.edu/nceo/Teleconferences/AAMASteleconferences/AAMASworkbook.pd
f
Pellegrino, J. W., Chudowsky, N. J., & Glaser, R. (Eds.) (2001). Knowing what students know:
The science and design of educational assessment. Washington, DC: National Academy
of Sciences.
U. S. Department of Education (2007a). Final Rule 34 CFR Parts 200 and 300: Title I—
Improving the academic achievement of the disadvantaged; Individuals with disabilities
education act (IDEA). Federal Register. 72(67), Washington DC: Author. Retrieved July
12, 2008, from the World Wide Web: http://cehd.umn.edu/NCEO/2percentReg/FederalRegApril9TwoPercent.pdf
U. S. Department of Education (2007b). Modified achievement standards: Non-regulatory
guidance. Washington DC: Office of Elementary and Secondary Education (OESE).
Retrieved from the World Wide Web on August 27, 2008, at
http://www.ed.gov/policy/speced/guid/nclb/twopercent.doc
Considerations for an AA-MAS
Page 14
SECTION I
IDENTIFYING AND UNDERSTANDING THE POPULATION AND THEIR
CURRICULUM
The first challenge is to determine who the students are who are in need of a new
assessment. The focus is on students with disabilities who are not achieving proficiency on
grade-level standards and who do not appear to be making significant progress towards
achieving that proficiency. But beyond that, it is important to explore various aspects of these
students, including the nature of their disability and why it might hinder learning. And, we would
be remiss not to explore the issues of curriculum and instruction to see whether opportunity to
learn is having a larger impact on performance than the nature of the disability.
These three chapters tie together these ideas to describe the population in terms of who
they are, necessary elements of their instruction, and how they learn. More specifically, each
chapter delves into different theories involving the fluidity of this population. Chapter 2, by
Rachel Quenemoen, focuses on the notion of the least dangerous assumption by considering
exclusionary criteria. It provides a history of regulations regarding students with disabilities and
discusses applications of the current regulations to the school environment.
In Chapter 3, Meagan Karvonen focuses on procedural integrity by providing an
overview of standards-based individualized education programs (IEPs) and describing how to
promote improved opportunities to learn the standards-based curriculum with specialized
instruction, services, and supports based on individual student learning characteristics. Then,
Chapter 4, by Jim Pellegrino, helps us understand explanatory constructs by discussing the
broader understanding of student cognition, describing possible sources of differences among
students that have implications for learning, instruction, and assessment. It provides information
on student cognition and explores issues related to barriers to learning for low achievers.
Considerations for an AA-MAS
Page 15
Understanding who the students are, what and how they are taught, and identifying any
barriers to learning is a first step towards understanding how best to assess what they know and
can do. This section provides the backbone for the later sections on test development and
technical issues.
This section is stronger because of the insightful comments of the expert panel members
who reviewed these chapters. In particular, comments from Claudia Flowers, Gerald Tindal,
Brian Gong, and Suzanne Lane were incorporated into these chapters.
Considerations for an AA-MAS
Page 16
CHAPTER 2
IDENTIFYING STUDENTS AND CONSIDERING WHY AND WHETHER TO ASSESS
THEM WITH AN ALTERNATE ASSESSMENT BASED ON MODIFIED ACHIEVEMENT
STANDARDS
Rachel Quenemoen
This chapter presents some of the complex issues that need to be considered when
identifying students who may benefit from participating in an alternate assessment based on
modified achievement standards (AA-MAS). The chapter starts with a historical perspective of
the regulatory language creating AA-MAS. It informs readers of the initial rationale for creating
AA-MAS and the concerns of advocacy groups. The preliminary requirements for identifying
students who may be eligible for participation in AA-MAS are also discussed. Research findings
about low-performing students are introduced to provide the readers with an understanding that
low-performing students are students with and without disabilities. These findings also illustrate
that identifying students who will benefit from participating in AA-MAS requires much more than
knowing the students‘ previous large-scale test performance or the students‘ disability category.
While student characteristics are important, they are only part of the consideration for
determining AA-MAS eligibility. A discussion of teacher perceptions of student characteristics
and opportunity-to-learn issues is followed by potential best-practice interventions and
instructional practices (i.e., Response to Intervention (RtI), progress monitoring) to provide
readers with information on strategies that may benefit all students. It is followed by an
examination of policy assumptions about instructional and curricular strategies used with
students and how they relate to assessment choices. Ultimately states will make policy
decisions that will define students who may participate in AA-MAS. These policy decisions are
framed with a discussion of social justice, guiding philosophy, and coherence of the overall
instruction, curriculum, and assessment systems. The chapter ends with a set of questions for
Considerations for an AA-MAS
Page 17
states to answer as they consider their options and potentially develop and implement an AAMAS.
The chapter is written in the context of the imperatives of a system accountability model
since the AA-MAS was conceptualized initially as an option within such a model. That is, this
chapter assumes that an AA-MAS is a key component of a policy that is designed to improve
student achievement and to narrow achievement gaps that have affected certain groups of
students differentially over time. The underlying policy assumes that poor performance by
students on the state assessment will result in consequences for schools and districts that will
motivate educators to provide better services to students, services that will enable them to learn
and achieve to proficiency. The path to these improvements is through improved instruction and
curriculum, although that implication seems to be missing in many discussions about
assessments used for system accountability. Because of this essential but sometimes neglected
component of system accountability, examples of standards-based instruction and curriculum
strategies and interventions are included in this chapter (as well as the next, Karvonen, Chapter
3, this volume) to augment this volume‘s focus on the assessment component of the policy
imperative. Possible validity-related questions regarding the relationship of high quality
standards-based instruction and curriculum to achievement of students with disabilities on
standards-based assessments are posed in the concluding section, and examples of studies
that have uncovered these relationships are cited (e.g., Barr, Telfer, & DiMuzio, 2009; Cortiella
& Burnette, 2007; Donahue Institute, 2005).
History of Assessment Options Related to AA-MAS
In order to understand which students may meet the requirements for participation in an
AA-MAS, it is important to begin with a review of the policy discussion that framed the initial
regulation. There was immediate and intense debate surrounding the announcement of the
Considerations for an AA-MAS
Page 18
proposed regulation, primarily focused on the research USED cited as the rationale. This debate
continues as states study whether or how they will implement these assessments.
In April 2005, addressing a group of chief state school officers and other officials,
Secretary of Education Margaret Spellings announced new flexibility in assessing students with
disabilities under No Child Left Behind (NCLB) regulations. Secretary Spellings called it a
"workable, sensible approach that was based on scientific research,‖ permitting states to
develop and use modified assessments for students with ―persistent academic disabilities.‖
These students were defined as those ―who need more time and instruction to make substantial
progress toward grade-level achievement.‖ The research base cited was summarized and sent
to all chief state school officers. These materials began with a reference to the earlier (2003)
NCLB regulation permitting alternate achievement standards for students with significant
cognitive disabilities, defined first in a notice of proposed rules in 2002, and finalized in 2003.
The 0.5% cap originally included in the August 2002 proposed regulation [1%] was
based on data outlining the prevalence rates of students with the most significant
cognitive disabilities. It was tied to a definition of such students which: 1) excluded
students with mild mental retardation and other students who were two or fewer
standard deviations below the mean, and 2) included students with intellectual
functioning and adaptive behavior three or more standard deviations below the
mean. When this rule was finalized, the Department expanded the cap to 1.0% to
allow States and districts more flexibility in its implementation and removed the
definition from the regulation. However, research conducted and reviewed by Reid
Lyon at National Institute for Child Health and Human Development and Jack
Fletcher at the University of Texas indicates that the 1.0% cap is, in fact, too low, if
the Department follows the definition currently provided in the December 2003
regulation's preamble (a student in one of 13 disability categories who cannot reach
grade-level standards, even with the best instruction possible) (USED, 2005).
Considerations for an AA-MAS
Page 19
The USED provided a summary of research that supported this increase in students who
could participate in alternate assessments against less challenging achievement standards. In
the research summary, Lyon and Fletcher found that ―the best-designed instructional
interventions achieved a range of success from a low of 50% to a high of 90% of participating
students reaching grade-level reading standards.‖ They concluded, that the ―totality of this
research suggests that there are about 1.8% to 2.5% of children who are not able to reach
grade-level standards, even with the best instruction‖ (USED, 2005).
Advocates for students with disabilities responded to the proposed new flexibility with
concern. Central to their concern was the fear that students who participate in an assessment
based on a lower standard will also receive instruction in a lower curriculum. The implication
was that the option of modified achievement standards would limit struggling students‘ access to
academic instruction and needed research-based interventions to accelerate learning and, over
time, preclude their attainment of a standard diploma. One advocacy group, the National
Council for Learning Disabilities (NCLD), critiqued the research base summarized by Lyon and
Fletcher for USED, noting that although the research on effective reading interventions cited
was important for remediation and for new methods for identification of learning disabilities,
these reading intervention studies did not support the federal policy changes proposed in the
new regulation (Wendorf, 2005). More recently, NCLD has concluded: ―The studies that were
originally used by the U.S. Department of Education in 2005 to justify the 20 percent number
were flawed. In fact, in one of the major studies cited to justify the new policy, only 11 percent of
the students were special education students and the additional studies cited did not include
any special education students‖ (Kaloi, 2007).
A number of the requirements in the final regulation reflect compromises that resulted
from this vigorous debate (e.g., students must have standards-based IEP goals; students are
not precluded from earning a standard high school diploma), but the controversy related to use
of modified achievement standards—and implications of use for student achievement—remains.
Considerations for an AA-MAS
Page 20
Given the controversy, state policymakers need to grapple with whether they believe this
is a distinct group, separate from both the group of students defined in the ―1%‖ regulation as
having significant cognitive disabilities and from other students with disabilities. State
policymakers, educators, and advocates had far less difficulty in coming to consensus on the
appropriateness of alternate assessments based on alternate achievement standards (AA-AAS)
as a pathway to higher expectations and achievement. In many ways, the students referenced
in the 1% regulation were, by and large, unarguably a distinct group, albeit a heterogeneous
group of students with unique characteristics, many with complex disabilities. However, in
implementation, some students may be inappropriately included in AA-AAS instead of a more
challenging assessment. Historical low expectations affect decision-making, as do past
performance patterns of students who have not been taught the content to be assessed.
Students who are inappropriately included in an AA-AAS may be harmed by assumptions that
they cannot learn the full range of grade-level academic content when the result is that they will
never be taught that content, leading to a self-fulfilling prophecy.
As quoted from the USED (2005) 2% materials, the earlier 1% figure was established as
a compromise; the estimates of how many students may have the most severe intellectual and
multiple disabilities initially (.5%) was supported by data in states that report moderate and
severe mental retardation as separate from all students with mental retardation and from
Centers for Disease Control data on incidence of correlated disability diagnoses. Thus, the 1%
cap on inclusion of scores from AA-AAS as proficient for AYP calculations incorporates some
flexibility already, but from a policy perspective to balance the flexibility, the cap was intended to
prevent inappropriate inclusion of too many students in a different achievement expectation.
The controversies of AA-AAS tend to be about the nature of the content being taught and
assessed and the technical issues related to test design, but not whether different achievement
standard(s) could be an appropriately high expectation for students with significant cognitive
disabilities.
Considerations for an AA-MAS
Page 21
In contrast, there is limited consensus on students referenced in the 2% regulation.
Some policymakers reference students who are ―just above‖ the students who participate in AAAAS, with an achievement expectation far below the grade-level expectation, perhaps adding to
the students already included in AA-AAS through the flexibility of the compromise 1% cap.
Others suggest that these students are those who ―just miss‖ the proficiency determination on
the general assessment, and the modified achievement standard should be ―just below‖ the
grade-level achievement standard. Yet another interpretation is that students with disabilities
perform on a continuum with no defined borders; thus, additional achievement standards may
be necessary in order to count more students as proficient against multiple standards on a
sliding scale set. The regulatory language and research base referenced in the regulation are
not clear about the target population for AA-MAS.
Ultimately, state-defined modified achievement standards should be a policy statement
of what is an appropriately high expectation for some state-defined group of students, an
expectation that should improve their achievement and outcomes in order to be consistent with
the letter and the spirit of NCLB and IDEA. It is essential that state policymakers articulate who
the target students are and how they build competence in the academic domains tested prior to
deciding whether and how to develop an assessment based on modified achievement
standards. Then, decisions about the design of the assessment itself can adhere to the policy
imperatives, instead of the assessment choices inadvertently shaping the policy outcomes.
Who Are the Students? The Complexity of Regulatory Requirements
The regulation specifies two primary requirements for participation in an AA-MAS. The
student must be identified as having a disability that precludes attainment of grade-level
achievement standards within the current year; and the student must have an IEP that
references grade-level content.
Considerations for an AA-MAS
Page 22
Disability Categories Overview: Limitations of Categorical Designations to Predict Attainment of
Standards
As cited in the research summary underlying the regulation, students with disabilities are
defined as having a primary disability label under 13 categories. The learning characteristics of
students who are assigned to these categories vary greatly both among and within the
categories. See Figure 2-1 for a summary of categorical distribution of students with disabilities,
by primary disability.
Figure 2-1. Distribution of Primary Disability Categories
*
*
Percentages in this figure are based on a total number of 6,007,832 students receiving special education
services (www.ideadata.org, 2007) counted under primary disability only.
**
Developmental delay is applicable only to children ages 3 through 9.
Based on the 2007 IDEA Part B Child Count data in the United States and outlying areas
(www.ideadata.org, 2007), 43.6% of students received special education services for specific
learning disabilities (see Figure 2-1). The next largest disability group is speech or language
Considerations for an AA-MAS
Page 23
impairments totaling 19.2%, followed by the category of students with other health impairments
at 10.5%, mental retardation at 8.3%, and emotional disturbance at 7.3%. Students with autism
make up 4.3% of students served in special education, and students with multiple disabilities
make up 2.2% of these students. Smaller categories of students in special education include
students with developmental delay at 1.5% (category for ages 3-9 only), hearing impairments at
1.2%, and orthopedic impairments at 1.0%. Students with visual impairments and traumatic
brain injury each make up 0.4% of students served by special education, while the remaining
0.02% of students make up the deaf-blindness category.
The criteria used to determine student eligibility for special education and related
services under IDEA are defined by each state (within certain federal parameters), and thus the
criteria vary from state to state. Although a few categories include criteria that are relatively
objective (e.g., vision, hearing, and physical characteristics), all categories include more
subjective judgment as well. Some categories (e.g., specific learning disabilities and mental
retardation) have widely varying criteria, subject to interpretation in multiple ways. It is common
within and across states to have the criteria yield students who have very different learning
characteristics sharing the same label and students who are very similar to one another having
different categorical labels. The National Association of School Psychologists (2002) position
statement on categorical labels summarizes research on categorical labels thus:
State by state and district by district variations exist in the definition and criteria for
specific disability conditions, despite the common language of IDEA and its regulations;
Particularly among the more subjective, ―mild‖ disability categories of Specific Learning
Disability, Mental Retardation, Emotional Disturbance, and Speech/Language
Impairment, labeled students show significant overlap in skills and receive highly similar
instruction;
Among students with very low achievement, there are no consistent distinctions between
those identified as disabled (e.g., SLD) and those who are considered ―Slow Learners‖;
Considerations for an AA-MAS
Page 24
Regardless of instructional needs, students tend to be placed in programs based on
labels; and
Despite increased opportunities for inclusion in the general education program, students
who are labeled as having a disability are less likely to have general education friends,
are less likely to have instructional goals tied to the general education curriculum, are
more likely to drop out of school and have lower rates of successful adult outcomes
(NASP, 2002).
These complexities related to disability categorical labels and criteria for determining
eligibility for special education services affect some groups of students more than others. For
example, English Language Learners (ELLs) are disproportionately identified as also having
disabilities in many states, and in many states, ELLs with disabilities are the lowest performing
group overall. The challenges of determining whether a student is eligible for special education
services in the context of limited English proficiency (or culture, ethnic, or socioeconomic status)
is additional reason to use caution in assuming that special education categorical labels are
useful for purposes of identifying students for assessment options. See Artiles, A. J., & Ortiz,
2002; Abedi, 2006; Abedi, 2007; Minnema, Thurlow, Anderson, & Stone, 2005 for more
information on these learners.
Still, some general comparisons can be made about the nature of the disability
categories. For example, the relationship of categorical label to students‘ ability to learn is
described by disabilities expert Martha Thurlow, director of the National Center on Educational
Outcomes (NCEO), as follows:
Most students with disabilities (75 percent altogether) have learning disabilities,
speech/language impairments, other health impairments, and emotional/ behavioral
disabilities. These students, along with those who have physical, visual, and hearing
impairments (another 4–5 percent), are all students without intellectual impairments.
When given appropriate accommodations, services, supports, and specialized
Considerations for an AA-MAS
Page 25
instruction, these students (totaling over 80 percent of students with disabilities) can
learn the grade-level content in the general education curriculum, and thus achieve
proficiency on the grade-level content standards. In addition, research suggests that
many of the small percent of students with disabilities who have intellectual
impairments (i.e., generally includes students in categories of mental retardation,
developmental delay, some with multiple disabilities, some with autism), totaling less
than 2 percent of the total student population, or about 20 percent of all students with
disabilities, can also achieve proficiency when they receive high quality instruction in
the grade-level content, appropriate services and supports, and appropriate
accommodations. (Thurlow, 2007)
The reality that many students with disabilities currently do not achieve at proficiency raises
questions of whether or not they are all receiving the required high quality instruction in the
grade-level content, appropriate services and supports, and appropriate accommodations.
Beyond that type of general observation across disability categories, it is difficult to
define how the specific categorical labels differentiate how students learn and demonstrate their
learning, or how, as required in the regulation, this disability prevents them from attainment of
grade-level achievement within the current year.
IEPs and Access to the General Curriculum
The second primary requirement for states to meet in designing participation criteria for
the AA-MAS (in addition to being identified as eligible to receive special education services
under IDEA) is the requirement that the student must have an IEP that references grade-level
content. In some ways, this requirement is redundant of the foundational requirements of being
eligible to receive special education services. That is, having access to the curriculum in order
to meet the educational standards of the public agency is how special education is defined in
IDEA: “specially designed instruction, at no cost to parents, to meet the unique needs of a child
Considerations for an AA-MAS
Page 26
with a disability, including instruction conducted in the classroom, in the home, in hospitals and
institutions, and in other settings …” [20 U.S.C. §1401 (29)]. Specially-designed instruction is
defined as “adapting, as appropriate to the child’s needs, the content, methodology, or delivery
of instruction to address the unique needs of the child that result from the child’s disability; to
ensure access of the child to the general education curriculum, so that the child can meet the
educational standards within the jurisdiction of the public agency that apply to all children‖ [34
CFR §300.39 (b)(3)].
These definitions are not new to IDEA 2004. In 1999, disability rights attorney Paul
Weckstein wrote, ―An IEP that sets lower goals and does not focus on these standards [that is,
the educational standards of the public agency] is usually illegal. Nor is it generally legal to
assign a student with disabilities to a low-track regular program that does not teach to these
standards. These rights are protected by the federal IDEA and Section 504 of the Rehabilitation
Act of 1973‖ (Weckstein, 1999, p. 314). These rights were reinforced through NCLB in 2001 and
the reauthorization of IDEA in 2004, according to Karger and Boundy (2008):
These two statutes [NCLB 2001 and IDEA 2004] together have lifted expectations for
learning and have underscored the legally enforceable rights of students with
disabilities to be effectively taught by highly qualified teachers, to be provided an
opportunity to learn to the same high standards as their peers without disabilities,
and to be included in all state and district-wide assessments. Protections provided
under Section 504 of the Rehabilitation Act of 1973 (Section 504) and the Americans
with Disabilities Act (ADA) and grounded in the Fourteenth Amendment of the U.S.
Constitution also ensure that students with disabilities are not subject to
discriminatory policies and practices. Rather, these students must be provided
meaningful opportunities to learn the knowledge and skills necessary to attain
proficiency on their respective state standards; full and fair opportunities to
demonstrate their level of mastery of state standards through participation in
Considerations for an AA-MAS
Page 27
appropriate assessments used to improve their instruction and learning; and equal
opportunities to be counted in the publicly reported data system that is used to hold
schools, districts, and states accountable for the academic performance of all
students. (p. 11)
Even though IDEA has required that all students who receive special education services
should be provided the services, supports, and specialized instruction so that they achieve
proficiency on the state standards, many students with disabilities have not been receiving that
instruction. In 1984, special education researcher Anne Donnellan wrote that ―the criterion of
least dangerous assumption holds that in the absence of conclusive data, educational decisions
ought to be based on assumptions which, if incorrect, will have the least dangerous effect on the
likelihood that students will be able to function independently as adults.‖ (p. 142). She
concluded that barring proof to the contrary, educators need to assume that poor performance
is due to instructional deficits instead of student deficits. The regulation requires that IEP teams
examine objective evidence demonstrating that the student‘s disability has precluded the
student from achieving proficiency and the guidance suggests that ―Such evidence may include
the student‘s performance on State assessments or other assessments that can validly
document academic achievement‖ (USED Non-regulatory Guidance, 2007, p. 17). IEP teams
will have to determine that such evidence is sufficient to ensure that instructional deficits are not
the cause of low performance, as opposed to the student‘s disability. Using data from largescale assessments of content that the student has not been taught seems to result in circular
logic for this purpose. That is, documenting student academic achievement on content that has
not been taught by an assessment of that content tells us nothing about whether or how the
student‘s disability precluded their achievement, and only tells us what the student knows and
can do prior to instruction. Later in this chapter and in the next chapter (Karvonen), there are
discussions of methods for documenting the effectiveness of instructional and curricular
Considerations for an AA-MAS
Page 28
strategies that could be used to ensure that instructional deficits are not the cause of low
performance.
The specification in the regulation that students who participate in an AA-MAS must
have standards-based IEPs was meant to assuage the concerns raised by advocates that a
less challenging achievement standard would result in further inappropriate instruction in a
lower level track. Whether or not the stipulation was in the regulation, states have an obligation
to make sure that all students with disabilities are receiving the services, supports, and
specialized instruction necessary for them to make progress in—and achieve proficiency in—the
curriculum based on the state standards defined for all students. Unless there is assurance that
has occurred, the least dangerous assumption is that poor performance relates to poor
opportunities to learn. See Karvonen, chapter 3, this volume for more information on how the
IEP process can be used to improve opportunities to learn.
Operationalizing the Regulatory Language: Identifying Who May Be Eligible
Since the AA-MAS regulation was finalized, states have struggled to identify the
students who are low performing and might be eligible for this assessment. Additionally, a few
researchers have attempted to understand what opportunities to learn the low-performing
students have had. Given the initial controversies about the research base, the limited utility of
categorical labels, the necessity of ensuring access to the general curriculum, and the lack of
agreement in the field of who these students are, these studies have been challenging. There
are debates about whether states should identify the students based on a percentage (i.e., the
2% of students with disabilities with the lowest scores), or whether studies should be based on
the characteristics of the population and their instruction in the content defined in the regulation
(i.e., those taught the curriculum but not likely to be proficient that year). Most studies have
taken the former approach, mining state assessment data for students who perform at the
lowest end, although data from opportunity-to-learn investigations are increasingly challenging
Considerations for an AA-MAS
Page 29
the assumptions that all low- performing students have indeed been taught the standards-based
curriculum.
Studies of Low-Performing Students
States have tried to operationalize who the eligible students are in varying ways
(Fincher, 2007; HB Study Group from Colorado, 2005; Marion, Gong, & Simpson, 2006; New
England Compact, 2007). One of the first investigations looking at how students with disabilities
currently perform on large-scale assessments under NCLB was done by researchers at the
National Center for the Improvement of Educational Assessment (NCIEA). As indicated in
Figure 2-2, the scores of students with disabilities occur at all scale scores in the distribution
and the scores of students without disabilities also occur at all scale scores in the distribution
(Marion, Gong, & Simpson, 2006).
Figure 2-2: Grade 4 Mathematics Scale Scores by Special Education Status
Lowest 5% of Scores to Left of Vertical Line
Reprinted by Permission
This study foreshadowed results of studies in multiple states: the lowest performing
students on state assessments under NCLB are not all students with disabilities. For example,
Perie (2009) summarized data mining approaches in two states, Georgia and South Carolina.
Considerations for an AA-MAS
Page 30
Georgia mined data from three years of the state test, identifying persistent low performers in
grades 5 and 8 as students with three years of data scoring in the lowest of three achievement
levels. South Carolina looked at grades 4 and 7, identifying students with two years of data
scoring in the lowest of four levels. In Georgia, the percentage of persistently low-performing
students who have documented disabilities ranged from 40% to 55%; in South Carolina, the
percentages of students with disabilities among the lowest performers ranged from 39% to 49%.
(Perie, 2009). These data will vary based on the nature of states‘ proficiency standards and
depending on the numbers of years for which data are available. That is, two years of data may
show different patterns than four years of data. The studies vary in methodology and findings,
and many of the research reports are included in the accompanying resources list. Chapter 10
(Domaleski, this volume) addresses ways states can learn from these efforts to design their own
data-mining study.
Student Characteristics and Opportunity-to-Learn Investigations
Several states have attempted to understand more about the educational characteristics
of these low-performing students. Perie (2009) summarized results of teacher surveys and
focus groups in several states that captured teacher perceptions about the nature of these
students‘ learning. These findings of teacher perceptions indicate: ―that the core academic
curriculum is significantly modified or specifically designed for the student; the student is making
fairly consistent progress but not at expected (or targeted) level; there is a gap between actual
performance and targeted level of performance which is evident over a period of time (at least 2
consecutive years); the gap continues to widen or remains the same; despite the provision of
―good‖ interventions, the student is not progressing at the rate expected for grade level;
accommodations alone do not allow the student to fully demonstrate knowledge; and all
appropriate accommodations have been exhausted.‖
Considerations for an AA-MAS
Page 31
Georgia has conducted a curriculum implementation survey of teachers that asked
whether low-performing students were receiving instruction in the grade-level curriculum and to
what depth and degree. Teachers self-reported their instructional practices and curriculum
choices; the investigators suggest that teacher interviews and direct classroom observations
would be preferable but not feasible with resources available. Although the results of the study
are not yet published, their initial findings suggest that at fifth and eighth grade in mathematics
and at eighth grade in reading, general education teachers have higher expectations (deeper
levels of understanding) for students than do special education teachers, but fifth grade reading
responses showed higher expectations among special education teachers (Fincher, 2009).
Qualitative data are being used to help illuminate these differences, but results are not yet
available.
The New England Compact study of the gap included interviews with teachers, and
substantial direct quotations from teachers. Compare and contrast the following teacher
observations of similar students (Parker & Saxon, 2007):
Teacher 1: They tend to be slow learners. They tend to be ―shady 80s,‖ that is what I
call them. Seventy to 75 makes you mentally retarded or learning impaired. If you are
in the 90s, then you are okay. These are shady 80s. They show up every day for
school, and they sit down and crank out their little homework. They don‘t have a clue
what the homework means, but they have it done. They always have a notebook.
They always have a sharpened pencil . . . These guys are good students in the
classroom. They have their notebook and their pencil. Clueless. They have no
mechanism to practice it . . . They never move from that very pretend area of
teaching. They do it fairly well. These kids are going to get 70s. These kids will get it
right, but they don‘t have a clue how they got it right. It never becomes theirs. (p. 7)
Teacher 2: My teaching practice just in the past year has changed dramatically. My
thrust now is to really concentrate on eighth grade GLEs [grade-level expectations],
Considerations for an AA-MAS
Page 32
even though most kids I have in my resource room are third grade level for math,
maybe fourth grade . . . I found math strategies presented on the third, fourth grade
level . . . they were exposed to strategy on their level, so then we worked through
problems up to the eighth grade level. So that‘s a new direction for me, because I‘m
not sure I‘ve always had the expectation that they could do eighth grade math. So
my expectation has changed, and my teaching practice as a result of that has
changed. (p. 8)
Clearly, as evidenced by the first teacher, not all educators are implementing the least
dangerous assumption related to their expectations for these students. In some cases,
refocusing attention on the needs of struggling learners changes teacher behavior (Teacher 2),
but that is not always the case (Teacher 1).
At the request of the Colorado State Legislature, a Colorado study group reviewed their
reading and math data (grades 3–10) and found that not all of the students performing in the
lowest one-third on the state tests were students receiving special education; accommodations
were not consistently provided to all eligible students; and they saw ―substantial longitudinal
growth‖ toward grade level achievement for the majority of the students over time. When the
study group initiated actual observations of instructional opportunities, they found that students
(with and without disabilities) who were making the greatest gains toward grade-level
achievement were those attending schools that provided ―intensive, targeted, research-based
instruction‖ (HB Study Group, 2005). The next section elaborates on possible strategies to
improve opportunities to learn and to identify students who may benefit from participation in AAMAS.
Strategies to Improve Opportunities to Learn and Identify Students Who May Benefit
from AA-MAS
There are studies focused on what is occurring in schools where students with
disabilities are performing well (e.g., Barr, Telfer, & DiMuzio, 2009; Cortiella & Burnette, 2007;
Considerations for an AA-MAS
Page 33
Donahue Institute, 2005). These studies consistently identify common characteristics among
schools where students with disabilities achieve at high levels. As summarized in one study, the
schools have: (1) a pervasive emphasis on the curriculum and alignment with the standards, (2)
effective systems to support curriculum alignment, (3) an emphasis on inclusion and access to
the curriculum, (4) a culture and practices that support high standards and student achievement,
(5) a well-disciplined academic and social environment, (6) continuous use of student data to
inform decision making, (7) unified practices supported by targeted professional development,
(8) access to resources to support key initiatives, (9) effective staff recruitment, retention, and
deployment, (10) flexible leaders and staff that work effectively in a dynamic environment, and
(11) effective leadership (Donahue Institute, 2004).
A recommendation from the Colorado study of students in the gap was that the state
should implement sound ―data-driven recommendations that focus on student learning and on
valid measurement of that learning.‖ The implementation of Response to Intervention (RtI)
strategies in many states is meant to ensure that research-based early literacy screening and
early intervention processes are used to help identify struggling learners as soon as possible.
According to the National Center on Response to Intervention, ―Response to intervention
integrates assessment and intervention within a multi-level prevention system to maximize
student achievement and to reduce behavior problems. With RtI, schools identify students at
risk for poor learning outcomes, monitor student progress, provide evidence-based interventions
and adjust the intensity and nature of those interventions depending on a student‘s
responsiveness, and identify students with learning disabilities.‖ (See
http://www.rti4success.org). In addition, RtI can help identify students with learning disabilities
sooner so they can achieve more. This approach is not limited to students with disabilities and
can assist states like Colorado that have a commitment to improve instruction and outcomes for
all students identified as low-performing, with and without disabilities. As such, RtI holds
Considerations for an AA-MAS
Page 34
promise for states that choose to emphasize intervention on instruction and curriculum as
opposed to relying solely on large-scale assessments to improve achievement and outcomes.
There is strong research and best practices documentation for RtI in reading, especially
in the primary grades. The Institute of Education Sciences (IES) has published a practice guide
based on currently available evidence, explicating five recommendations for implementing RtI to
identify students in need of intervention. It further describes how to carry out each
recommendation and identifies potential roadblocks to implementation. (Gersten, Compton,
Connor, Dimino, Santoro, Linan-Thompson, & Tilly, 2008). The authors note that while multi-tier
efforts like RtI can prevent learners from falling behind through early implementation of
interventions, they also note that ―some aspects of RtI, however, (such as tier 1 instruction) are
still poorly defined, and there is little evidence that some practices of targeted instruction will be
effective‖ (p. 8). Still, they suggest that a coordinated multi-tier program can prevent beginning
readers from becoming struggling readers in the later grades, and possibly prevent referrals to
special education. They provide an exhaustive references list to support their recommendations,
and categorize the recommendations based on the strength of the evidence in the research
base. (Gersten et al., 2008).
Fuchs and Fuchs (2006) also provide helpful guidance to states and districts considering
use of RtI related to reading. They document differences among educators about the
appropriate use of RtI, noting ―The first group views RtI mostly in terms of providing prevention
and advocates for more tiers. The second group regards RtI mostly as identification and
classification procedures and argues for fewer tiers‖ (p. 94). In addition, they suggest that
practitioners and researchers vary in their preference for application of RtI, with practitioners
viewing RtI as a problem-solving approach while researchers favor the use of standard
treatment protocols. They identify numerous unanswered questions and unresolved issues,
including the challenge of false positive identification for special education services in use with a
problem solving approach and false negative identification in use with a standard treatment
Considerations for an AA-MAS
Page 35
protocol approach (Fuchs & Fuchs, 2006). They question which error is worse, but do not
answer the question. Ultimately, that becomes a critical policy decision to be made at the local
and state levels in implementation, and something that must be monitored closely.
Fletcher (2008) summarizes issues for consideration that may serve to inform efforts in
states to grapple with defining students who meet the regulatory requirements for participation
in an AA-MAS. He notes that RtI is not appropriate solely as a special education initiative or
method to meet criteria for identification of learning disabilities; it ultimately is a regular
education initiative for all students. Once students are identified as meeting criteria for special
education services based on not responding well to interventions, Fletcher suggests that does
not give us the information needed to understand why they are not responding or how to teach
or assess them. Echoes of advocates‘ concerns about the strength of the research base under
the regulation seem to be borne out in his conclusion that ―more research is needed on the
characteristics of students who do not respond well to intervention since we have not really had
the opportunity to study this subgroup from cognitive, interventional, and neurobiological
perspectives‖ (p. 9). The regular education basis for RtI also has potential benefit for states that
feel an obligation to intervene on behalf of all low-performing students who do not have
disabilities. RtI processes have the potential to improve outcomes for all struggling students, not
just those with disabilities.
Other progress monitoring approaches also may contribute to improved achievement for
all students. Curriculum-based measurement (CBM) is of particular interest given the strong
research base for many CBM methods. Recently, Fuchs, Seethaler, Fuchs, and Hamlett (2008)
proposed CBMs as meeting the requirements for determining eligibility under the 2% regulation:
―That is, CBM progress monitoring can be used to provide the necessary database on (a)
whether grade-level proficiency is expected, (b) whether appropriate instruction has been
provided, and (c) whether progress in response to that instruction is appropriate. Moreover, as it
satisfies the three-pronged requirement for evidence related to identifying the 2% population,
Considerations for an AA-MAS
Page 36
CBM progress monitoring simultaneously provides the added advantage of helping schools
enhance special education outcomes‖ (p. 160). Given that CBMs often are used within an RtI
framework to monitor progress following an intervention, these authors suggest that efficiencies
of scale will emerge that produce data to determine eligibility for an AA-MAS.
Placing these approaches within the context of standards set for the grade level, as
required in current standards-based systems, is essential. Deno, Fuchs, Marston, and Shin
(2001) discuss the implications of a normative approach to setting achievement expectations
that assume that the current typically observed growth rates for students with disabilities are
reasonable and predict future growth. These assumptions lead to the conclusion that students
with learning disabilities will learn at a slower rate than typical peers. The researchers speculate
that this kind of reasoning reflects the "well-accepted fact that special education, as typically
practiced in this country, fails to regularly incorporate demonstrably effective methods" (Deno et
al., 2001, p. 515). If this is true, then systemic interventions on the system of special education
in this country to correct these deficits should be a priority in every state, as opposed to
accepting lower rates of student learning.
Response to intervention and other progress monitoring approaches hold much promise
for improved outcomes and higher expectations. Still, critical contextual challenges must be
addressed. These challenges affect the implementation of effective progress monitoring for
students with disabilities. They include historical limited access to challenging standards-based
curriculum, instruction, and assessment; concerns about the target of measurement, that is,
whether only basic skills or a full range of rich and challenging grade-level content should be
measured; and limited practitioner understanding about use of data for effective provision of
instructional strategies, interventions, and supports in a standards-based system (Quenemoen,
Thurlow, Moen, Thompson, & Morse, 2003). In most schools, the path to assurance of the least
dangerous assumption requires guideposts of continuing staff development, support, and
oversight.
Considerations for an AA-MAS
Page 37
Even with the identification (and improvement) of opportunities to learn through
instructional quality and curriculum access, the validity of assessments hinges on whether all
students who have learned the grade-level content can show what they know on the
assessments. The National Accessible Reading Assessment Projects (NARAP) are conducting
a program of research and development designed to make large-scale assessments of reading
proficiency more accessible for students who have disabilities that affect reading. They suggest
that creating ―accessible reading assessments based on accepted definitions of reading and
proficiencies of reading requires knowledge of the issues specific to each disability and how
they affect reading and the assessment of reading.‖ They have prepared a series of papers by
categorical label to serve as discussion guides for partners working on test development,
summarized in Thurlow, Moen, Liu, Scullin, Hausmann, and Shyyan (2009). This may serve as
a resource to state stakeholders and consultants as they grapple with how to first teach and
then assess students with varying characteristics. Still, there are limits to what discussions
based on categorical labels will yield, given the well-documented subjectivity of state-defined
eligibility criteria used to determine categorical labels and the heterogeneity of students within
each category.
Regulatory Options, Policy Prerogatives, and Implications
State policymakers have a great deal of flexibility in decisions about whether and how to
implement AA-MAS. They also have a responsibility to articulate thoughtfully the philosophy and
beliefs that these decisions reflect. This thoughtful decision-making yields two important tests of
whether the choices made are sustainable. The first test is whether students who have
historically been underserved and ill-prepared for adult life will see improved outcomes. The
second is whether the technical defensibility of an AA-MAS rests in validity arguments that
begin in these decisions and play out in each step of assessment design. This second test is
discussed throughout this volume in all the chapters; the first test begins with the choices on
Considerations for an AA-MAS
Page 38
who participates in AA-MAS and the effect of that participation on their access to high quality
standards-based instruction and curriculum, and ultimately, their academic achievement.
Key Political and Social Justice Issues
The test of whether students will see improved outcomes must be considered as the
initial decisions are made. Public articulation and discussion of these decisions can ensure that
historically low expectations and opportunities to learn are not reinforced. In the spirit and the
letter of both NCLB and IDEA, implementation of AA-MAS should expressly raise expectations
and result in higher achievement for students who participate in the option. In the current
standards-based accountability-driven reform model, any option that encourages or rewards
less challenging standards for any student who could achieve at grade level (assuming they
have access to the curriculum and are instructed effectively) undermines the entire system of
school reform. For most students, with and without disabilities, we cannot predict with any
accuracy whether they will achieve at grade level when instructed effectively. Thus, the least
dangerous assumption requires that all students receive that instruction. Careful monitoring of
consequences of the system is essential to ensure the intended positive consequences are
occurring and unintended negative consequences are not.
Data-mining of current student performance such as that described in Chapter 10
(Domaleski, this volume) can shape the dialogue among stakeholders. Key questions to
consider include: What evidence exists to suggest that students with disabilities who are lowperforming differ from minority students or poor students who are low-performing? What
evidence exists to support the policy assumption that some of these students cannot achieve at
grade level even if their opportunity to learn (OTL) is appropriate? What do direct observations
of student instructional and curricular opportunities tell us, and how does that compare and
contrast to teacher perceptions? What evidence exists to support the notion that OTL is
appropriate for all students in the subgroup? States need to articulate the belief systems that
underlie their policy decisions and justify these beliefs based on data. There is a fundamental
Considerations for an AA-MAS
Page 39
tension within the regulatory options. If the consequences of participating in an AA-MAS are
positive, how can we deny the same opportunities to other students with similar achievement
profiles other than disability status? If the consequences are negative, how can we justify the
use of the options for students with disabilities?
Varying State Assumptions, Philosophies and Beliefs, and Their Implications
As states articulate their philosophy and beliefs related to students who may participate
in AA-MAS, they are constructing the foundation of their assessment choices. Examples of
underlying philosophies and beliefs, which yield very different definitions of eligible students,
assessment options, and eventual student achievement are below. These are very different
positions; most states participating in public discussions can categorize their philosophy as
more or less similar to one of these three options, at least one of which does not seem to match
the AA-MAS regulatory requirements. Each of these philosophical positions has infinite
variations, many of them controversial, but the dominant profile of each is presented here for
general consideration.
Guiding Philosophy 1: Student performance based on past test results and
instructional opportunities predict appropriate expectations for future performance.
There is a group of students with disabilities who are being instructed in a modified
curriculum, one linked to the general curriculum as required by IDEA but different
from their enrolled-grade peers. Thus, the state needs to provide a modified
assessment to match the curriculum. Given that many of these students will never
catch up to their enrolled-grade peers, this will result in very different long-term
outcomes. Thus, policy choices focus on modifying the assessments to match a
modified curriculum. The target population for inclusion in these modified
assessments includes students who may be participating in a modified curriculum.
This philosophy does not appear to match the requirements of the AA-MAS, but may
Considerations for an AA-MAS
Page 40
result in an additional assessment option under AA-AAS flexibility in some states. As
such, it is a “nonexample” of a philosophical foundation for AA-MAS.
Implications of Guiding Philosophy 1: The assumption that there is a group of students
with disabilities who participate in a modified curriculum different from their typical peers or
development of an assessment option that does not result in a standard diploma would raise
peer review and advocacy concerns under the requirements of the 2% regulation. Instead, this
assumption better fits a very small group of students who may now participate in AA-AAS but for
whom the current alternate achievement standards are not an appropriately high expectation. In
other words, these are students who currently may be topping out on the AA-AAS or performing
at the very lowest levels of the general assessment. There generally are very few students in
most states who fit into this group, once opportunity-to-learn barriers are taken into account. An
assessment built under this assumption would better match the requirements for a new, more
challenging AA-AAS than the requirements of an AA-MAS. The 1% regulation permits states to
set more than one alternate achievement standard. Some states do not have a full 1% of
students participating in the AA-AAS, while other states find that although they have 1% or
higher participation in AA-AAS, they recognize that for some students who participate, it does
not reflect a sufficiently high expectation. Given the assumptions of separate track curriculum,
assessment, and outcomes assumed in this first philosophy, development of a second AA-AAS
with higher performance expectations than on the existing AA-AAS may be warranted.
Guiding Philosophy 2: Student achievement needs to be considered in the context of
systematic opportunities to learn (services, supports, specialized instruction, etc.) the
general curriculum based on grade-level content and achievement standards. For
students who do not as yet have these opportunities, intervention on their
opportunities to learn is the first priority in order to accelerate their learning. Given
that it will take some time for these students to regain any lost ground from previous
limited OTL, short term AA-MAS options are an appropriate interim measure to hold
Considerations for an AA-MAS
Page 41
schools accountable for their learning. These AA-MAS should cover substantially the
same content as the general assessment, but may have more content coverage at
the lower end of the grade-level content expectations. Still, the state policy is based
on an assumption that eventually, with strong interventions and evidence-based
practices in student services, supports, and specialized instruction, these students
will achieve the same outcomes as typical peers.
Implications of Guiding Philosophy 2: This philosophy seems to match the language of
the 2% regulation, although it is still challenging to conceptualize how to use AA-MAS as interim
accountability measures with the expectation that students will catch up over time. (See
Domaleski, Chapter 10, this volume for examples of how this may affect design of the
assessment system.) Those students who are identified through state data-mining efforts as
making gains each year but who are still below proficiency over several test administrations may
be the target population under this theory. Response-to-Intervention and progress monitoring
tools should be in place to get better data to understand their needs and to plan improved
services supports, and specialized instruction to ensure that they do, indeed, catch up over
time. A state working under this philosophy may consider requiring data from multiple sources
(e.g., RtI, progress monitoring, interim assessments, etc.) to document a decision about a
student‘s participation in AA-MAS each year, and also require documentation of the evidencebased practices that have been implemented for the student based on these data. This
evidence of opportunities could be included in requirements under the accountability system in
addition to the assessment data themselves. They may also decide to implement an AA-MAS
only at selected grades, and instead promote and support through training and other resources
district/school efforts like RtI and progress monitoring to prevent students from falling behind.
States and districts may need to design intensive training and coaching efforts to ensure
teachers have the skills necessary to effectively teach these students. Alternatively, a state may
require that students exit from the AA-MAS in a set number of years, or sunset the AA-MAS
Considerations for an AA-MAS
Page 42
completely, depending on what data reveal about whether the combined practice of increased
OTL and selected use of AA-MAS support improved achievement for individuals and for the
subgroup or not over the longer term.
Guiding Philosophy 3: Student characteristics (related to disability, ethnicity, poverty,
and other demographic categories) have been associated with a history of low
expectations and limited opportunities to learn. Until we intervene to change learning
opportunities, we cannot use past performance to predict which students could
achieve to proficiency. This is true for students with disabilities and for those without
disabilities. Designing a new assessment option now jeopardizes an opportunity for
reform to ensure all students are taught well. The state policy must ensure students
are being taught first, then see who is left achieving at low levels once all are taught
well before building any assessment options that may risk perpetuating lowered
expectations and outcomes.
Implications of Guiding Philosophy 3: This approach assumes that the state will provide
leadership and resources to intervene on historical lack of opportunities to learn for all the
students described as the low performers in state data sets discussed earlier in this chapter.
Certainly systematic interventions like RtI and progress monitoring, new interim or formative
assessment options, and continued efforts in states to implement programs like Positive
Behavioral Supports (PBS) contribute to this effort. These efforts are difficult to take to scale,
but are laudable and necessary. They are not, however, sufficient to ensure that all students
who historically have had limited opportunities to learn and continued low expectations for their
performance achieve at higher levels. States that voice this philosophy will look for ways to
intervene on attitudes and beliefs of educators and the public who still find it appropriate to
expect less of students simply because they have a disability label, or who are poor or of
minority status. They will support these educators with increased training and coaching on
Considerations for an AA-MAS
Page 43
evidence-based methods to effectively instruct all students, with and without disabilities, in the
challenging standards-based curriculum.
This philosophy results in a choice not to develop an AA-MAS at this time. Yet, the state
systematically is addressing a key component of the accountability system, that of leveraging
improvements in standards-based instruction and curricular access for all students, especially
those students who have historically been underserved, including those with disabilities. As
such, this philosophy highlights the choices states have to make on how to most effectively use
limited resources to ensure the best possible outcomes. The costs associated with these
decisions include opportunity costs of those choices not made.
Another philosophical stance, outside the parameters of the current accountability
system: There is another philosophy underlying assessment choices in states that is not
included above, since it does not assume that participation in these assessments should lead to
systematic improvement in instruction and curriculum for low-performing students and ultimately
to improved student achievement, as is assumed in the current accountability model. That set of
beliefs is focused around the need to provide what is often described as ―relief‖ to districts and
schools from the consequences of system accountability for students with disabilities. If this
philosophy dominates state discussions, then much of the guidance in this volume will be
overdesigned for state purposes in developing such an assessment. A more straightforward and
less expensive method to support this option is to exempt up to 2% of students from the
assessment and accountability, but then the state leaders would be responsible for supporting
the implications of that decision, both under Federal compliance requirements and from
advocacy groups interested in protecting the rights of students with disabilities.
Relationship of AA-MAS to the State Options on Accommodations and Alternate Assessments
Based on Grade-level Achievement Standards
Regardless of the underlying philosophy, states will have to define how the AA-MAS
contributes to an overall system of assessment for accountability purposes. That is, there
Considerations for an AA-MAS
Page 44
should be a coherent educational logic or relationship between the AA-MAS and other alternate
assessments (based on AAS or GLAS) as well as with the general assessment. Universal
design, accommodations policies, decision-making guidelines, training, and monitoring should
support the validity of the assessment system so that all students are included in ways that
support use of assessment results in a standards-based accountability system.
Use of accommodations on a standards-based assessment assumes that careful
consideration is given to whether the grade-level content and achievement standards being
measured remain constant despite the use of the accommodation. The collective knowledge
base on the effects of accommodations on the content being measured is growing, but there are
considerable complexities in the case of the most challenging content and student
combinations. This can result in students who have learned the skills and knowledge assessed
on the test scoring below proficiency. Some students with disabilities may have barriers to
showing what they know on the state general assessment, even with accommodations. These
barriers may be related to several different disability characteristics (e.g., processing, sensory,
physical, or emotional barriers). In order to ensure the validity of inferences of the assessment
system for these students, states may need to consider an alternate assessment based on
grade-level achievement standards (AA-GLAS).
Perie (2009) posed key questions that states will have to answer related to the
relationship among the state assessment options: ―Do we expect to see smooth transitions from
one assessment to the next? How do the performance expectations relate? Is Proficient on the
AA-MAS similar in nature to Proficient on the general assessment? Is it closer to Basic? Or is it
somewhere in between? Is there an expectation that the AA-MAS may provide a stepping stone
for students to reach Proficient on the general assessment? Or, is the expectation that students
taking the AA-MAS are a unique population that will always need the modifications provided? Is
a student who scores Advanced on the AA-MAS prepared to take the general assessment or an
AA-GLAS or are they simply exceeding the criterion on their own assessment?‖
Considerations for an AA-MAS
Page 45
Standards-based assessment systems should include strategies that permit all students
to show what they know and can do on the academic content standards defined for typical
peers of the same age and grade level, despite the barriers of disability. However, any change
in academic achievement standards for a group of students should be reviewed to ensure that
these options raise the bar of academic expectations, and thus increase system accountability
for the outcomes of students who may participate in the option. The foundation for decisions
about assessment options must rest and be defended based on a publicly articulated set of
beliefs about teaching, learning, and eventual outcomes for all students.
Questions for States as They Choose, Build, and Defend the Validity of Their Approach
States must build an argument to defend the validity of their approach to AA-MAS that
begins with articulated definitions of who the students are who will benefit from the AA-MAS,
and why. The types of questions states must answer in their validity argument need to be
framed while they are making the decision to develop an AA-MAS, during development of an
AA-MAS, and in their continuous improvement and consequential validity studies as they
implement their AA-MAS. Related to identification of the students, these include questions about
the students who may participate; the instruction, support, and resources provided to these
students at the local level and the quality of IEP decisions made about their participation; and
about how the AA-MAS is appropriate for its articulated purposes and uses—including those of
improving achievement for students who participate.
Later chapters in this volume (especially Marion, Chapter 9; Domaleski, Chapter 10)
focus on the validity argument and on practical implications of state choices. It is important to
include questions about how a state can validate that the appropriate students are identified and
participate in an AA-MAS. This chapter is written under the assumption that implementation of
AA-MAS should expressly raise expectations and result in higher achievement for students who
participate in the option. That is a testable assumption, and should be considered as a state
Considerations for an AA-MAS
Page 46
makes choices and implements an AA-MAS. Questions about the appropriateness of AA-MAS
for improving student outcomes include (adapted from Marion, 2007):
How does this assessment provide a more accurate measure of the knowledge and
skills of the participants compared with the general assessment?
How does development of an AA-MAS yield better inferences about the students than
other assessment approaches, such as improved general assessment design,
appropriate accommodations, or development of AA-GLAS?
What are the potential costs and benefits of competing uses of resources, including
targeted staff development on instructional and curricular interventions for teachers of
struggling learners instead of assessment development and implementation?
How will the inclusion of the AA-MAS as part of the state‘s assessment system lead to
better instructional and curricular opportunities for these participating students?
Other questions identified by policymakers and stakeholders.
Marion (2007) has identified potential sources of data for many of these questions in the
form of a workbook for AA-MAS development and documentation. Design of validity studies
should begin while planning for any assessment option, and tools such as this volume and
earlier work done by Marion and others can guide those designs.
In conclusion, in order to make decisions about whether and how to design an AA-MAS,
states need to articulate a guiding philosophy that defines which students will benefit from
participating in the assessment, and how they will benefit. States then can design the
assessment based on the specific learning characteristics and opportunities to learn of students
who may participate. Over the next several years, states will need to work in partnership with
researchers and experts in curriculum, instruction, assessment, and disability issues to better
understand and identify the appropriate students for participation in the AA-MAS. Ultimately, the
goal of this work must be to understand how these students can build and demonstrate the skills
and knowledge they need to earn a standard diploma and to succeed in adult life.
Considerations for an AA-MAS
Page 47
References
Abedi, J. (2007). English language learners with disabilities. In Cahlan-Laitusis, C. & Cook, L.
Accommodating student with disabilities on state assessments: What works? (Ed.)
Arlington, Council for Exceptional Children.
Abedi, J. (2006). Psychometric Issues in the ELL Assessment and Special Education Eligibility.
Teacher’s College Record, Vol. 108, No. 11, 2282-2303.
Artiles, A. J., & Ortiz, A. (Eds.). (2002). English language learners with special needs:
Identification, placement, and instruction. Washington, DC: Center for Applied
Linguistics.
Barr, S., Telfer, D., & DiMuzio, M. (2009). The Ohio Improvement Process (OIP) as a strategy
for creating a viable SSOS for all Ohio districts and schools. Ohio Department of
Education presentation to USED Intradepartmental Work Group, February 4, 2009.
Cortiella, C., & Burnette, J. (2007). Challenging change: How schools and districts are
improving the performance of special education students. New York: National Council for
Learning Disabilities.
Deno, S., Fuchs, L., Marston, D., & Shin, J. (2001). Using curriculum-based measurement to
establish growth standards for students with disabilities. School Psychology Review,
30(4), 466-472.
Donahue Institute. (2004, Oct). A study of MCAS achievement and promising practices in urban
special education: Report of field research findings (Case studies and cross-case
analysis of promising practices in selected urban public school districts in
Massachusetts). Hadley, MA: University of Massachusetts, Donahue Institute, Research
and Evaluation Group. Retrieved March, 2006, from
http://www.donahue.umassp.edu/docs/?item_id=12699
Donnellan, A. (1984). The criterion of the least dangerous assumption. Behavior Disorders, 9,
141-150.
Fincher, M. (2007). ―Investigating the academic achievement of persistently low performing
students‖ in the session on Assessing (and Teaching) Students at Risk for Failure: A
Partnership for Success at the Council of Chief State School Officers Large Scale
Assessment Conference, Nashville TN, June 17-20, 2007. Retrieved August, 2007, from
http://www.ccsso.org/content/PDFs/12%2DMelissa%20Fincher%20Paul%20Ban%20Pa
m%20Rogers%20Rachel%20Quenemoen.pdf.
Fincher, M. (2009). Personal communication with the author on April 16, 2009.
Fletcher, J. ( 2008). Identifying learning disabilities in the context of response to intervention: A
hybrid model. Retrieved March 2009, from
http://www.rtinetwork.org/index2.php?option=com_content&task=view&id=331&pop=1&p
age=0&Itemid=45
Fuchs, D., & Fuchs, L. S. (2006). Introduction to Response to Intervention: What, why, and how
valid is it? Reading Research Quarterly, January- March. 93-99.
Considerations for an AA-MAS
Page 48
Fuchs, L., Seethaler P. M., Fuchs, D., & Hamlett, C. L. (2008). Using curriculum-based
measurement to identify the 2% population. Journal of Disability Policy Studies. I9(3),
151-161.
Gersten, R., Compton, D., Connor, C. M., Dimino, J., Santoro, L., Linan-Thompson, S., & Tilly,
W. D. (2008). Assisting students struggling with reading: Response to Intervention and
multi-tier intervention for reading in the primary grades. A practice guide. (NCEE 20094045). Washington, DC: National Center for Education Evaluation and Regional
Assistance, Institute of Education Sciences, U.S. Department of Education. Retrieved
April, 2009, from http://ies.ed.gov/ncee/wwc/publications/practice guides/ .
HB 05-1246 Study Committee (2005, December 31). Assessing “students in the gap” in
Colorado. Retrieved May, 2006, from
http://education.umn.edu/nceo/Teleconferences/tele11/ColoradoStudy.pdf.
IDEA Part B Child Count. (2007). Students ages 6 through 21 served under IDEA, Part B, by
disability category and state: Fall 2007 (Tables 1-3). Retrieved September 2008, from
www.IDEAdata.org
Kaloi, L. (2007). A Misunderstood Policy with the Potential to Harm Millions of Kids. Retrieved
March 8, 2009, from http://www.ncld.org/content/view/1279/311/
Karger, J., & Boundy, K. (2008). Including Students with Dyslexia in the State Accountability
System: The Basic Legal Framework, Perspectives on Language and Literacy
International Dyslexia Association, 34(4), Fall, 2008.
Marion, S. (2007). A technical design and documentation workbook for assessments based on
modified achievement standards. Minneapolis: National Center on Educational
Outcomes. Retrieved April, 2009, from
http://cehd.umn.edu/nceo/Teleconferences/AAMASteleconferences/AAMASworkbook.pd
f
Marion, S., Gong, B., & Simpson, M. A. (2006, Feb. 6). Mining achievement data to guide
policies and practices on assessment options. Teleconference on Making Good
Decisions on NCLB Flexibility Options. Minneapolis: National Center on Educational
Outcomes. Retrieved April, 2009, from
http://education.umn.edu/nceo/Teleconferences/tele11/default.html
Minnema, J., Thurlow, M., Anderson, M., & Stone, K. (2005). English language learners with
disabilities and large-scale assessments: What the literature can tell us (ELLs with
Disabilities Report 6). Minneapolis, MN: University of Minnesota, National Center on
Educational Outcomes.
National Association of School Psychologists (NASP). (2002). Position statement: Rights
without labels. Original statement adopted by NASP Delegate Assembly in 1986.
Revision adopted by NASP Delegate Assembly, July 14, 2002. Retrieved March 10,
2009, from http://www.nasponline.org/about_nasp/pospaper_rwl.aspx
National Center on Educational Outcomes. (2008). Previous studies that examine who the
students are who may qualify to participate in an AA-MAS. White paper for Expert Panel
Considerations for an AA-MAS
Page 49
Meeting Multi-State GSEG Consortium toward a Defensible AA-MAS, Minneapolis, MN:
March 10, 2008.
New England Compact. (2007). Reaching students in the gaps; A study of assessment gaps,
students, and alternatives. Newton MA: CAST, The Education Alliance, EDC, INTASC,
Measured Progress.
Parker, C. E., & Saxon, S. (2007). Teacher views of students in the gaps. New England
Compact Enhanced Assessment Grant. Retrieved March 9, 2009, from
http://www.necompact.org/Teacher_views_of_students.pdf
Perie, M. (2009). Understanding the AA-MAS: How does it fit into a state assessment and
accountability system? Presentation to CCSSO SCASS meeting, February 4, 2009.
Available at www.nciea.org.
Quenemoen, R., Thurlow, M., Moen, R., Thompson, S., & Morse, A. B. (2003). Progress
monitoring in an inclusive standards-based assessment and accountability system
(Synthesis Report 53). Minneapolis, MN: University of Minnesota, National Center on
Educational Outcomes.
Thurlow, M. (2007). The challenge of special populations to accountability for all. In D. Clark
(Ed.), No child left behind: A five year review (Congressional Program, Vol. 22, No. 1)
(pp. 39-44). Washington, DC: The Aspen Institute.
Thurlow, M. L., Moen, R. E. Liu, K. K., Scullin, S., Hausmann, K. E., & Shyyan, V. (2009).
Disabilities and reading: Understanding the effects of disabilities and their relationship to
reading instruction and assessment. University of Minnesota: Partnership for Accessible
Reading Assessment.
United States Department of Education. (2005). Raising achievement: Alternate assessments
for students with disabilities. Retrieved March 8, 2009, from
http://www.ed.gov/print/policy/elsec/guid/raising/alt-assess-long.html.
USED. (2007). Nonregulatory Guidance Modified Academic Achievement Standards.
Weckstein, P. (1999). School reform and enforceable rights to quality education. In J. Heubert
(Ed.), Law and school reform: Six strategies for promoting educational equity. (pp. 306389). New Haven: Yale University Press.
Wendorf, J. (2005). Open letter to Secretary of Education Margaret Spellings, March 27, 2005.
Retrieved March 8, 2009, from http://www.ncld.org/content/view/289/
Considerations for an AA-MAS
Page 50
CHAPTER 3
DEVELOPING STANDARDS-BASED IEPS THAT PROMOTE EFFECTIVE INSTRUCTION
Meagan Karvonen
States have the option to create Alternate Assessments based on Modified Achievement
Standards (AA-MAS) for students with disabilities who perform persistently and significantly
below grade level. This population is heterogeneous within states, and may also be defined
differently across states (Perie, 2009; Quenemoen, chapter 2, this volume). Regardless of their
characteristics and needs, this population of students requires extensive supports and effective
instruction in order to meet high expectations and transition back into eligibility for grade-level
assessment.
The importance of effective instruction for this target population of students was
recognized in the final regulations on the AA-MAS [U.S. Department of Education [USED],
2007a, 34 CFR 200.1(f)(iii)]. There are three alternate assessment options under NCLB: (a)
those based on grade-level achievement standards (AA-GLAS), (b) those based on alternate
achievement standards (AA-AAS), and (c) AA-MAS. While access to instruction based on
grade-level academic content standards is recognized for all students under IDEA, the
regulations for AA-MAS represent the first explicit assumption under NCLB that instruction has
afforded the student maximum opportunity to learn what is assessed. Opportunity to learn
requires a curriculum that is well-aligned to state standards and assessment so students can
show what they know and can do. For the population of AA-MAS-eligible students, a wellaligned curriculum is a critical foundation. Since these students have not been successful with
previous classroom instruction, they will also need for this curriculum to be designed and
delivered using the best, evidence-based instructional practices available.
Considerations for an AA-MAS
Page 51
The nonregulatory guidance on AA-MAS (USED, 2007b) reinforces the LEA‘s
responsibility to design a highly effective curriculum and instruction and requires a standardsbased individualized education program, or IEP:
The primary reason for requiring IEP goals based on grade-level academic content
standards is to ensure that students who participate in an assessment based on
modified academic achievement standards receive instruction in grade-level content
so that they can make progress towards meeting grade-level proficiency (p. 28).
In other words, the IEP is a way of driving the student‘s academic curriculum toward the
goal of transitioning back into grade-level assessments. The IEP also provides the evidence for
how the instructional program will incorporate supports to address the student‘s characteristics
stemming from the disability.
While states may differ in how they plan to approach AA-MAS or educate their students
who may be eligible for this assessment, there are federal requirements related to the contents
of IEPs that are universal across states. The IEP cannot completely capture the entire academic
curriculum for a student eligible to take AA-MAS. However, planning and writing the IEP can
help educators think systematically about how to design high-quality instruction for this
population. The purpose of this chapter is to describe how a standards-based IEP can support a
well-designed educational program that ensures access to grade-level curriculum using
effective instructional practices. After a description of the IEP as a document, some principles
are offered for effective instruction for this population of students. Next is a description of how
the IEP can promote good instruction, followed by suggestions for how states can provide
guidance to IEP teams that are responsible for creating standards-based IEPs for this
population of students. Requirements for state-level monitoring of IEP systems are also
reviewed. The chapter ends with a section on validity evidence related to curriculum and
instruction, and general conclusions. While most of this chapter is written with the intent to
promote ideal and potential best practices, there are still some areas in which current, realistic
Considerations for an AA-MAS
Page 52
practices have been challenged to reach the optimal, ―best practice.‖ The conclusion section of
this chapter acknowledges some of these challenges. Along with the other two chapters in this
section, which discuss possible characteristics of the target population including how they learn,
this chapter sets the stage for the remaining chapters on designing and implementing an AAMAS.
An Overview of the IEP
The Individualized Education Program, or IEP, is written at least annually for each
student with a documented disability. Although there are several required components, the
general idea is to consider the student‘s present levels of performance and documented needs
and strengths in order to create a comprehensive plan for the student‘s priorities that year.
Those priorities do include academics, but also reflect other supports that are essential to
provide the student with meaningful instruction given the features of his or her disability that
make access more challenging. By specifying these supports, students can more fully
participate and be successful in the pursuit of their educational goals. The IEP is written and its
contents agreed upon by a team that includes one or more special education teachers, a
general education teacher, other educational professionals (e.g., speech/language therapists,
counselors), the student‘s parents or guardians, and in some cases the student as well.
IEPs have been part of special education services since 1975. However, they have not
always played a central role in describing the academic curriculum. Karger (2004) reviewed the
historical literature on IEPs and noted a variety of problems including a disconnect between
IEPs and the curriculum, poor congruence across sections within the IEP (e.g., between
documented needs and annual goals), and special educator perceptions that the IEP was
irrelevant to instruction. In the early years, IEPs documented special education services that ran
parallel to general education (Ahearn, 2006). In the 1990s, IEP teams began determining that
students would spend more time in general education settings, often for nonacademic activities
Considerations for an AA-MAS
Page 53
(e.g., music classes, lunchtime in the cafeteria). This trend toward inclusion in general education
settings gave students with disabilities more access to the school building, but did not give full
access to the academic curriculum that was taught to students without disabilities. What special
educators now refer to as ―access to the general curriculum‖ was mandated in the purpose of
special education as written in the Individuals with Disabilities Education Improvement Act
(IDEA) of 1997:
To address the unique needs of the child that result from the child's disability; and to
ensure access of the child to the general curriculum, so that he or she can meet the
educational standards within the jurisdiction of the public agency that apply to all
children (34 C.F.R. § 300.26(b)(3))
With the enactment of IDEA 1997, IEPs were required to address the student‘s present
levels of performance, include annual goals and short-term objectives to help the student
progress in the general curriculum, and document program modifications and supports the
student would need in order to progress in the general curriculum. Thus, IEP teams were first
required to address general curriculum access just over a decade ago.
The next reauthorization of IDEA came after NCLB, in 2004. Under IDEA 2004, IEPs
must now contain the following elements:
A statement of the child's present levels of academic achievement and functional
performance
A statement of measurable annual goals, including academic and functional goals
designed to (a) meet the child's needs that result from the child's disability to enable the
child to be involved in and make progress in the general education curriculum; and (b)
meet each of the child's other educational needs that result from the child's disability;
Considerations for an AA-MAS
Page 54
A statement of the special education and related services and supplementary aids and
services, based on peer-reviewed research to the extent practicable, to be provided to
the child, or on behalf of the child;
A statement of any individual appropriate accommodations that are necessary to
measure the academic achievement and functional performance of the child on State
and districtwide assessments consistent with section 612(a)(16) of the Act; and if the
IEP Team determines that the child must take an alternate assessment instead of a
particular regular State or districtwide assessment of student achievement, a statement
of why the child cannot participate in the general assessment and why the particular
alternate assessment selected is appropriate for the child….(34 CFR §§ 300.320 300.324)
In designing the IEP, teams must consider the student‘s strengths, parents‘ concerns,
results of the most recent evaluation, and ―academic, developmental, and functional needs‖ of
the child [34 CFR §300.324(a)(1)(i-iv)]. In addition, teams must decide whether certain special
factors must be considered in planning the educational program for the student:
1. In the case of a child whose behavior impedes the child's learning or that of others,
consider the use of positive behavioral interventions and supports, and other strategies,
to address that behavior;
2. In the case of a child with limited English proficiency, consider the language needs of the
child as those needs relate to the child's IEP;
3. In the case of a child who is blind or visually impaired, provide for instruction in Braille
and the use of Braille unless the IEP Team determines, after an evaluation of the child's
reading and writing skills, needs, and appropriate reading and writing media (including
an evaluation of the child's future needs for instruction in Braille or the use of Braille),
that instruction in Braille or the use of Braille is not appropriate for the child;
Considerations for an AA-MAS
Page 55
4. Consider the communication needs of the child, and in the case of a child who is deaf or
hard of hearing, consider the child's language and communication needs, opportunities
for direct communications with peers and professional personnel in the child's language
and communication mode, academic level, and full range of needs, including
opportunities for direct instruction in the child's language and communication mode; and
5. Consider whether the child needs assistive technology devices and services. [34 CFR
§300.324(a)(2)]
There is language throughout IDEA 2004 [see §300.321(a)(1-7), §300.321(b)(1-3)] that
clearly emphasizes the academic curriculum and the link between assessment and instruction.
The importance of academics is even recognized in guidance that general educators should be
members of the IEP team. Relative to academic instruction, the IEP is now to reflect how the
student will access the general curriculum and what the academic priorities are. New guidance
on AA-MAS also calls for monitoring progress toward academic goals.
What Does it Mean for an IEP to be “Standards-Based”?
In interviews with representatives from 18 states, Ahearn (2006) found some confusion
over the term ―standards-based‖ as it applied to IEPs. Part of that confusion came from the
different types of standards (i.e., content and achievement). IDEA 2004 requires ―(i) A statement
of measurable annual goals, including academic and functional goals designed to—(A) Meet the
child's needs that result from the child's disability to enable the child to be involved in and make
progress in the general education curriculum; …‖ [34 CFR §300.320(a)(2)(i)]. IEPs that meet the
IDEA requirements include a broad range of information to explain how the student will access
the general curriculum.
The two key elements specific to students who take AA-MAS are that there must be
annual goals based on grade-level academic content standards and that there must be
Considerations for an AA-MAS
Page 56
mechanisms in place to measure progress toward achieving those goals. The specific language
from the final AA-MAS regulations is as follows:
a. ―The student‘s IEP must include goals that are based on the academic content
standards for the grade in which the student is enrolled‖ …That is, while students
may have their performance evaluated against modified achievement standards, they
must be taught academic content based on grade-level content standards.
b. …―and be designed to monitor the student’s progress in achieving the standardsbased goals‖. [U.S. Department of Education, 2007a, 200.1(e)(2)(iii), (f)(2)(i), emphasis
added]
Although access to a curriculum based on state content standards has been guaranteed
since 1997 and was reinforced in IDEA 2004, the requirement that state content standards be
reflected in IEP goals is new with AA-MAS (Thurlow, 2008). States have taken various
approaches to interpreting the requirement for ―standards-based‖ IEPs (Ahearn, 2006).
Students‘ present levels of academic performance are evaluated based on their mastery of state
content standards and areas in which they have not yet mastered those standards. Some states
then require the IEP team to write goals that emphasize the skills the student will need in order
to make progress in those content standards that have not yet been mastered. Other states
require IEP teams to consider academic content standards broadly when evaluating present
levels of performance and setting goals, but do not require teams to base those decisions on
specific grade-level standards (Ahearn).
With the standards-based approach to IEPs, it is clear these documents now play a
much different role in educational planning in 2009 than they did in 1975. As states think about
how to address the regulatory requirements for AA-MAS, it is important to keep in mind that the
IEP cannot be a map of the entire academic curriculum for a student. It cannot contain
documentation of all instructional strategies used with a student in a given year. It can, however,
drive a purposeful planning process and ensure that good decisions about educational goals,
Considerations for an AA-MAS
Page 57
grounded in a clear understanding of student needs, are established. Thus, before addressing
IEPs directly, this chapter first considers implications for the planning, delivery, evaluation, and
adjustment of instruction for this population of students.
Effective Curriculum and Instruction
One common criterion for determining eligibility for the AA-MAS is that the student will
not attain proficiency on grade level assessment in the current year, despite having received
appropriate instruction all year. In reality, some students who are eligible for AA-MAS have
probably not been proficient on grade-level assessments for multiple years—they are what
some states call ―persistently low-performing.‖ Thus, the target population may vary in their
patterns of past performance, depending on how states define eligibility for the AA-MAS.
Eligibility decisions based on past performance will be tied to states‘ guiding philosophies and
theories of action about the AA-MAS (see Quenemoen, Chapter 2, and Marion, Chapter 9, this
volume). Regardless of past performance patterns, the assumption is that these students are
unlikely to achieve proficiency on the general assessment despite having had instruction in
grade-level academic content. Based on surveys and focus groups from teachers in several
states, Perie (2009) synthesized findings from several focus groups on students who may be
eligible for AA-MAS and provided a potential list of their characteristics:
Require intensive, specially designed instruction and individualized supports
Instruction repeated many times in many ways to make progress
Passive learners, non-risk takers
Learning and meta-cognitive deficits (e.g., poor generalization, difficulty transitioning
between topics, applying learned strategies) [slides 19–21]
While this list is not exhaustive and may not represent the characteristics of AA-MAS-
eligible students in all states, these types of characteristics reinforce the notion that highly
effective instruction is the lynchpin for this population. Without meaningful access to the general
Considerations for an AA-MAS
Page 58
curriculum, low-performing students will have little hope of working toward grade level
expectations. What is essential in designing an instructional program that will help students
meet these high expectations for growth? The Access Center (2006) offered the following as
characteristics of educational programs that provide meaningful access to the general
curriculum:
The general education curriculum includes appropriate, standards-based instructional
and learning goals for individual students with disabilities, as well as reflects an
appropriate scope and sequence.
Materials and media being used are appropriate, research-based, and documented as
being effective in helping students with disabilities learn general education content and
skills.
Appropriate, research-based instructional methods and practices that have a track
record for helping students with disabilities learn general education content and skills are
being used.
Research-based supports and accommodations that have a track record of helping
students with disabilities learn general education content and skills are being used.
Appropriate tools and procedures for assessing and documenting whether students with
disabilities are meeting high standards and achieving their instructional goals are being
used (pp. 4, 6).
This section of the chapter describes several characteristics of highly effective
curriculum and instruction for students who present with the challenges described above.
Suggested resources are provided in Appendix B of this report.
1. Students are given access to grade-level content
As noted earlier, ensuring access to grade level content standards is a fundamental
expectation in the Federal Regulations on AA-MAS [USED, 2007b, §200.1(f)(2)(iii)]. State
Considerations for an AA-MAS
Page 59
content standards are often organized around content knowledge and processes reflected in
standards or frameworks from national groups (e.g., National Council of Teachers of
Mathematics, 2000; National Reading Panel, 2000). States differ in the grain size (i.e., level of
specificity) and sequences of content area skills and knowledge when articulating their content
standards. (See Pugalee and Rickelman, Chapter 5, this volume, for additional discussion on
state content standards.)
By law, students eligible for AA-MAS are to be taught a curriculum that is based on
chronologically-appropriate, grade-level standards. Unlike AA-AAS, where students receive
instruction in content that ―links‖ to grade-level content standards, students who take AA-MAS
are expected to be working ―in‖ grade-level content standards. That does not mean they are
taught the state standards directly; instead, they are provided with a curriculum that is aligned to
state standards. Pugalee and Rickelman (Chapter 5, this volume) define curriculum as, ―a set of
planned instructional activities that are designed to allow students to document achievement of
their knowledge and skills‖ (p. 155).
One of the challenges in targeting academic goals for this target population of students
is that they may lack foundational skills from earlier grade levels, upon which the current gradelevel content standards are based. Teachers will walk a fine line between helping the student
master skills from earlier grades while also being taught a curriculum based on current gradelevel standards. Ahearn (2006) offered the following for how students might work toward IEP
goals based on content standards from an earlier grade:
Such students are expected to make more than one year of progress through
standards-based instruction because the needed skills are targeted by the teacher.
Teachers scaffold instruction (i.e., provide supports as necessary) and prerequisite
skills are used to work toward the grade-level standards. For example, a student who
cannot read 6th grade materials may work toward a grade-level standard that calls for
analyzing written materials. The cognitive processes associated with that higher level
Considerations for an AA-MAS
Page 60
reading skill can still be taught while the student accesses the grade-level materials
in a different way. (p. 8).
The caution here is that, according to the AA-MAS regulations, IEP goals must be based
on the current (chronologically-appropriate) grade-level content standards. In order to plan an
effective curriculum based on remediation of missing skills and progress in current skills,
teachers will have to be knowledgeable about how topics within a strand (e.g., algebra, writing)
relate to each other within a grade level and how they build across grades. Teachers also will
need to make choices about how best to sample content from within a standard. Clear
alignment between assessment and instruction is essential if student assessment scores are to
allow for valid inferences about proficiency against a set of standards (American Educational
Research Association [AERA], American Psychological Association [APA], & National Council
on Measurement in Education [NCME], 1999). General education teachers with expertise in
content areas may play a critical role in helping special educators select content and design
instruction for this population. These content experts will be fluent in their state‘s grade-level
content standards and national standards for the discipline; know how instruction is designed to
promote learning in that skill; and understand how the components relate to and build upon
each other (see Pugalee & Rickelman, Chapter 5, this volume).
Curriculum materials can make it easier (or more difficult) for teachers to deliver gradelevel curriculum. Adaptations of grade-level materials may be increasingly necessary the farther
away from grade-level proficiency a student is working, and the more unique or complex the
student‘s learning needs. The accessibility issue related to curriculum materials becomes
increasingly important as students enter upper grade levels. In middle schools, textbooks are
more like reference books and contain vocabulary that is above grade level; factual knowledge
is emphasized over procedural knowledge (Hill & Erwin, 1984; Jitendra, Nolet, Xin, Gomez,
Renouf, Iskold, et al., 2001). There may be a tendency for special educators working with AA-
Considerations for an AA-MAS
Page 61
MAS-eligible students to significantly adapt materials, potentially jeopardizing access to gradelevel content if the adaptations stretch too far.
2. Instruction consists of proven practices that allow teachers to set a trajectory toward
performance that can be evaluated against grade-level achievement standards
Meaningful progress toward grade-level achievement requires accurate assessment of
the student‘s current level of performance — using multiple, high-quality assessments that are
appropriate for the student and provide valid inferences about strengths and areas for growth
(34 CFR § 200.1(e)(2)(ii)(B). When states determine eligibility for AA-MAS based on a pattern of
low performance across years (vs. a single year), baseline performance may also be
investigated retrospectively to help teachers determine how firmly established a pattern of
limited or no progress has been. Armed with this information, teachers would then also draw on
their knowledge about how students learn in order to plan instruction. (See Pellegrino, Chapter
4, this volume, for an extensive description of student cognition).
Once present levels of performance are known, teachers will then adopt research-based
strategies — those that work with the type of student whose IEP they are designing. The What
Works Clearinghouse (http://ies.ed.gov/ncee/wwc/) is one source of information about which
interventions have evidence of effectiveness. However, one pitfall is that the strict evidence
standards set by the WWC means that when the evidence base is sparse or based on less
rigorous designs (e.g., those without a comparison group), emerging evidence is not
synthesized. For example, in a March 2009 search of the middle school math interventions
listed by WWC, only 12 of 50 interventions had reports available. (Two more reports were
pending.) The rest were not synthesized because of a lack of studies that met the criteria. Even
when the WWC offers helpful syntheses, teachers will need to consider whether the
interventions ―work‖ for students who are eligible for AA-MAS based on their state‘s criteria.
When WWC evidence is lacking, the field may look to other sources of evidence for state-of-the-
Considerations for an AA-MAS
Page 62
art practice (see the Spring 2009 issue of Exceptional Children, for example, or literature
syntheses and meta-analyses published in journals such as Review of Educational Research).
Selecting research-based strategies and applying them indiscriminately will not
automatically boost student learning. Teachers also use their knowledge of content and how
skills link and build upon one another toward a goal for performance that year. Concepts such
as learning progressions and learning maps will be discussed in detail later in this volume
(Chapters 4, 5, 7, and 9), but it is worth mentioning them briefly here. Where research exists to
support cognitive models of learning, teachers may use those models to plan for a sequence of
instruction. Where such evidence does not yet exist, familiarity with typical content sequences
may guide planning. Since all students do not learn in the same sequence, awareness of when
skills really are (or are not) prerequisite or foundational concepts will also be important.
Regardless of whether the teacher thinks in terms of progressions, maps, or sequences,
identifying baseline knowledge and skills at a fine-grained level (e.g., a student‘s specific
content knowledge within strands and process skills within a subject area) rather than coarsegrained level (e.g., the student scored below basic in English language arts the previous year),
the foundation will be set for a meaningful path for teaching and learning.
If teachers have accurately identified present levels of performance and effective
instructional strategies for that skill and that student, and have a solid understanding of how to
build skills in that area, what should be the target for that student? One option would be to
consider the state‘s modified achievement level descriptors and evaluate how much progress
the student would need to make in order to move from his or her current AA-MAS achievement
level to the next highest one (see Perie, Chapter 7, this volume, for a discussion of modified
achievement standards). Over a period of years, long-term planning for the student could
include annual goals for certain skills with a multi-year goal of transitioning back to eligibility for
assessment options based on grade-level achievement standards. IEP teams will make
decisions about how to set interim achievement targets at lower levels while also maintaining
Considerations for an AA-MAS
Page 63
the highest possible performance expectations. General educators with deep content expertise
may be helpful in setting reasonable interim targets. The concepts of learning progressions or
learning maps may also guide decisions about interim targets. IEP teams may wish to guard
against setting lower targets for several consecutive years, as there may be a point at which the
student is unintentionally tracked into a level of performance that precludes his or her later
participation in assessments based on grade-level achievement standards.
Assuming a growth trajectory has been set and the student makes progress during the
academic year, there is a risk that the student will lose momentum—and perhaps even lose
ground that was gained—during long breaks (i.e., summer vacation on a 9-month calendar or
intersession on a year-round calendar). IEP teams have the option to include Extended School
Year (ESY) services as part of the student‘s plan in order to prevent loss of the skills and
knowledge that the student built during the academic year.
3. Instruction is flexible and responsive to student progress
Even with a well-designed, long-term plan for instruction, teachers cannot just assume
the student will progress according to that plan. They will need to monitor progress closely and
know when to make decisions to adjust instruction. A vague plan to monitor progress using
―teacher observation‖ will not be sufficient. Teachers need to know how to design or identify,
and use, effective formative assessment methods.
Established techniques such as curriculum-based measurement (CBM) and progress
monitoring can help teachers track student progress. CBM offers a way to periodically assess
student progress toward long-term goals using assessments that are technically sound but easy
to use in everyday instruction (Deno, 2003; Stecker, Fuchs, & Fuchs, 2005). Stecker (n.d.)
provides a concise overview of how to use CBM for progress monitoring, including how to use
CBM data to set IEP goals and short-term objectives. One option for monitoring progress is for
teachers to chart performance on a graph with a superimposed trend line (see an example in
Considerations for an AA-MAS
Page 64
Figure 3-1). Establishing the trend line between present level of performance and target level of
performance helps the teacher monitor interim progress. There are numerous resources
available on the internet for teachers to customize and create graphs for tracking student
progress using Excel or Web-based forms (see Appendix B). A synthesis report published by
the National Center on Educational Outcomes (NCEO; Quenemoen, Thurlow, Moen,
Thompson, & Morse, 2004) places short-term CBM in the broader context of progress
monitoring linked to summative assessments and improvement across years.
Figure 3-1. Example of Graph for Tracking Student Progress against a Trajectory
Joseph
Grade 2
Washington Elementary
CBM Monitoring Graph
Correctly Read Words Per Min
12
10
8
6
6
J
o
5
4
4
4
3/20
3/21
5
5
5
3/22
3/23
3/24
3
2
2
2
1
2
2
3/15
3/16
2
1
0
3/10
3/11
3/12
3/13
Assessment Dates
3/14
3/17
3/18
Monitoring: CRW
3/19
Baseline: CRW
3/25
CBM Trendline
As the literature base on CBM and progress monitoring grows, new studies emerge that
may help teachers refine their practice to maximize benefit while minimizing effort. For example,
Jenkins, Graff, & Miglioretti (2009) determined that reading growth (defined as words read
correctly) could be determined with less frequent measurement but that it was important to
Considerations for an AA-MAS
Page 65
obtain sufficient baseline data and an appropriate number of measurements at each data
collection period.
Teachers may effectively set a goal, collect data to measure progress using technically
sound measures, and recognize when students are not progressing as they had hoped. The
other critical skill is how to determine why that progress has not been made. If present levels of
performance and student needs were correctly identified in the planning stages, the teacher
may be reasonably satisfied that there were no unidentified deficits in prerequisite concepts.
However, the teacher may find weaknesses in student knowledge of certain components of the
target skill, or may be able to uncover the source of the problem through error analysis and
identification of the student‘s faulty thinking. Although the adjustment of instruction is dealt with
extensively in the progress monitoring and Response to Intervention (RtI) literature, Nitko (2004)
offers several general recommendations for how to conduct diagnostic assessments of
students‘ problems in learning targeted content. The ability to monitor and adjust instruction
requires creativity and persistence by teachers who design and deliver the instructional
program.
4. The instructional program minimizes barriers and provides the full range of supports
necessary to promote growth
To provide effective instruction, teachers need to determine what aspect of student
performance represents the student‘s actual mastery level and skill deficits, and what aspects
may be related to the disability or other barriers to learning. IEP teams are responsible for
determining what types of accommodations are appropriate in instruction and assessment,
based on how the student‘s disability impedes his or her ability to learn in the general
curriculum. (See the section titled ―State Guidance to IEP Teams‖ later in this chapter for a full
discussion of accommodations.) If the student has characteristics considered under IDEA to be
special factors that interfere with learning (behavior, limited English proficiency, visual
impairment, hearing impairment, or the need for assistive technologies), those areas must also
Considerations for an AA-MAS
Page 66
be addressed. As the options for assistive technologies expand, teachers will need good
resources for identifying appropriate devices that minimize barriers stemming from specific
student needs (see Appendix B). Teachers may also want to consider how environmental
variables, such as instructional setting and grouping, may influence students‘ access to the
general curriculum (Soukup, Wehmeyer, Bashinski, & Bovaird, 2007).
Given the diversity of students who may be eligible for AA-MAS, not all students in a
class will need the same learning supports. It is possible that AA-MAS-eligible students can
receive learning supports as a matter of standard practice for all students. Differentiated
instruction allows teachers to identify how each AA-MAS-eligible student will access grade-level
content; respond to student progress and adjust instruction accordingly; and incorporate
learning supports needed due to student disabilities. Differentiation is more than just grouping—
it involves variation in the curriculum (i.e., different materials) and how it is taught (i.e., individual
work, small group, whole group; Gibson & Hasbrouck, 2008). However, differentiation for the
target population cannot involve teaching to lower grade-level standards.
There are two more promising options for minimizing barriers and providing supports
that are worth mentioning here. Both are applicable to broader ranges of students and may be
useful beyond the target population for AA-MAS. Universal design for learning (UDL) offers a
framework for reducing barriers by allowing for multiple means of representation, expression,
and engagement (CAST, 2008). UDL is offered as a way to build flexibility and differentiation
into instruction from the outset, thereby avoiding the need to adapt curriculum after the fact.
Response to Intervention (RtI) focuses on prevention of long-term failure through early
assessment and intervention, with frequent progress monitoring and adjustment (National
Center on Response to Intervention, n.d.). Although described in IDEA 2004 as a means of
identifying and serving students with specific learning disabilities, RtI can be applied to an entire
student body for multi-tiered interventions rather than a single sub-population with identified
Considerations for an AA-MAS
Page 67
disabilities. RtI is discussed in more detail in Chapter 2 of this volume (Quenemoen) and in
resources listed in Appendix B.
This section provides only an overview of principles that may guide effective instruction.
States and LEAs will determine how best instructional practices translate locally, given state
policies and local norms. However, the practices described here offer alternatives to help
students who need to demonstrate substantial progress if they hope to reach grade-level
proficiency. In many ways, the principles described here are suggestions for operationalizing the
IDEA requirement for specifically designed instruction [20 U.S.C. § 1401(29)].
Using the IEP to Promote Quality Instruction
The federal guidance that requires the IEP to contain goals based on grade-level content
standards implies that what gets documented is what gets taught. However, the IEP cannot
reflect a student‘s entire academic curriculum in detail. How does the IEP meet the letter of the
law while also providing a blueprint for the student‘s broader educational needs? This section
offers suggestions for how to use the IEP to guide high-quality academic instruction, by
identifying needs, writing goals, and monitoring progress. It concludes with a discussion on how
IEP teams can discuss and select optimal supports based on individual needs.
Reviewing Present Levels of Performance and Identifying Need
When planning an IEP, teams carefully consider present levels of performance and
prioritize needs in order to support growth toward grade-level achievement. Present levels of
performance may be described using recent assessment data (collected using sound
measures) that are specific about student performance and areas for growth. For example, an
IEP team will be better able to target goals in areas of need with detailed description (e.g., ―Joey
can read fifth grade texts with 70% accuracy and 85 WPM but has difficulty inferring meaning of
unfamiliar words using context clues.‖) than generalities (e.g., ―according to the WoodcockJohnson, Joey is reading at grade level 4.2‖). CBM may be useful in determining present levels
Considerations for an AA-MAS
Page 68
of performance, as a series of baseline measures can be used to describe student performance
in specific ways that promote good goal-setting (Stecker, n.d.).
Selecting and Writing Academic Goals
The final regulations for AA-MAS indicate that the IEP must contain goals ―based on the
academic content standards for the grade in which the student is enrolled‖ [USED, 2007a, p. 9;
200.1(e)(2)(iii)]. Since the AA-MAS must cover the same range of content as the typical gradelevel assessment (USED, 2007b, p. 21), IEP teams will need to consider the full range of gradelevel content standards for each subject to determine what strands or objectives represent
priorities for instruction. It is important to remember that if a student takes an AA-MAS in one
subject but is eligible for assessment against grade-level achievement standards in other
subjects, the IEP is not required to include those areas assessed against grade-level standards.
In other words, IEP goals are only required for areas in which the student‘s disability is
interfering with participation in the grade-level achievement test. However, the IEP team may
choose to write goals for subjects in which the student is assessed against grade-level
achievement standards.
In subjects the IEP team has determined will be assessed using the AA-MAS, the target
for instruction is the student‘s chronologically age-appropriate, grade-level content standards.
Even if a state has designed its content standards as vertical progressions that build across
grade levels, it is not acceptable to base IEP goals on academic standards at a lower grade
level (NCEO, 2007).
Annual IEP goals represent priorities in the student‘s curriculum that year. That does not
mean that the content of a goal statement must match the wording of a state content standard.
IEP teams may identify essential skills that support general curriculum access in that area.
Indeed, teams may identify pivotal skills—those that cross academic subjects and promote
growth across the curriculum (Browder, Spooner, Wakeman, Trela, & Baker, 2006). For
Considerations for an AA-MAS
Page 69
example, the team may write a goal for the student to learn certain strategies for monitoring
comprehension of a reading passage. This self-evaluation skill may not be part of a reading
content standard in that grade level, but a student who can monitor comprehension may make
better progress in reading as well as other content areas. As Pellegrino (Chapter 4, this volume)
notes, metacognitive skills are most effective if taught in specific content areas rather than
outside the context of a content area. A team may determine that a specific content goal is the
best choice to support the student‘s growth toward grade-level proficiency, but the team should
be discouraged from treating the content standards as an à la carte menu.
Once the priorities for IEP goal contents have been determined, the team then follows
best practices for writing effective IEP goals. Good IEP goals are measurable, which means
they cannot be vaguely written and need to reflect an observable behavior. They also specify
the conditions under which the behavior will be demonstrated and the criterion for performance.
There are numerous guides for writing effective IEP goals. Two excellent examples are
Bateman & Herr (2006) and Courtade-Little & Browder (2005).
Monitoring Goal Attainment
With IDEA 2004, IEP teams are no longer required to write short-term objectives (STOs)
for each goal [except for students who will be assessed using AA-AAS; see USED, 2007b, p.
29, IDEA 614(d)(1)(A)(I)(cc)]. Without this requirement, IEP teams will need to be diligent in
thinking through and planning for progress monitoring. Progress must be reported at least as
frequently as it is for students without disabilities (which typically translates to a report card
cycle). It is hard to imagine that AA-MAS-eligible students can successfully accelerate their
learning if teachers are doing the bare minimum to meet the letter of the law for monitoring
progress. Bateman and Herr (2006) offer a convincing argument that teams should write STOs
to operationalize the broader goal, even though that step is no longer federally required. Teams
Considerations for an AA-MAS
Page 70
may also develop and document other methods in accordance with their state‘s interpretation of
the federal requirement to monitor progress.
Regardless of whether teams write STOs for these students, it is important that the
criterion reflected in the goal be based on projected growth from present level of performance,
in a given period of time, with a target that sets the student on a path toward grade-level
achievement. Ideally, methods for monitoring goal attainment would follow some of the CBM or
progress monitoring approaches described earlier in this chapter. Again, these methods would
be based on technically sound, appropriate assessment instruments, administered at a
frequency that allows teachers to monitor and report progress and adjust instruction without
becoming a burdensome process that detracts from instruction.
Choosing and Designing Supports
As mentioned previously, IEP teams will design educational programs for AA-MASeligible students that allow for achievement based on modified rather than grade-level
achievement standards, while also building toward eventual participation in assessments based
on grade-level achievement standards. IEP teams will decide how to design appropriate
supports for learning, without setting the standard so low that the student is not held to the
highest expectations possible that year. The IEP can provide clear markers about what supports
are needed, in which parts of the curriculum, and under what conditions. If the team instead
provides blanket coverage of certain supports across all subject areas and settings (including
accommodations, positive behavior support plans, and ESY services), the team may end up
overcompensating for the student‘s disability and providing supports that are not needed or
even serve to limit student growth.
One decision support tool IEP teams can use to plan for academic instruction is provided
in Table 3-1 (Quenemoen, 2009). Teams can ask themselves a series of questions to consider
individual student strengths and needs within the context of his or her educational system.
Considerations for an AA-MAS
Page 71
These questions would be asked and answered differently for each student, since each student
may have different priorities for the academic year.
Table 3-1. Question Guide to Support IEP Team Decision-Making Process
SYSTEM OPPORTUNITIES AND
CHALLENGES—representatives
from school and district must
provide this information and
support
1. What is the required content in the
next grade level?
2. Where and when during the school
year is the required content taught in
Math? ELA? Science? Social Studies?
Other? How is it taught?
3. What instructional and curricular
options are currently available in this
school to allow all students to achieve
proficiency in the goals and standards
set for all students?
4. What array of services does the
school provide to meet the students‘
other needs?
5. What is the curricular map into the
future, what are essential
understandings every student needs to
achieve this year?
6. How does all this go together, with
professional development, support,
continuous improvement, community
linkages . . . so that ALL children are
successful?
STUDENT STRENGTHS AND NEEDS—IEP team grapples
with data-based decision-making to identify services,
supports, and specialized instruction so that the student will
be successful
1. What are the student‘s current strengths and needs in the
academic content areas? What data do we have to make that
determination? What accelerated or remedial services and
supports are necessary to ensure success in the content for the
next grade level?
2. What adaptations and accommodations can the student use
to access the grade-level content regardless of specific deficit
basic skills in reading or mathematics or English language? What
data do we have to support these choices? How will we
determine if their use is effective or needs changing?
3. What specific instructional strategies work well for this
student? What types of curricular materials work well for this
student? If the needs of this student and the options available
don‘t align well, what aids, services, supports, and instruction
does the student need to be successful in spite of gaps? How
can the current options be changed?
4. What specific nonacademic needs does this student have?
What goals and objectives will address those needs? How do
these relate to the student‘s academic success?
5. How can we set priorities to ensure the essential
understandings are mastered by this student, but still allow the
TIME the school and the student needs to address all needs.
6. How do we align curriculum, instruction, supports, services,
and needs so that THIS child is successful at grade level?
Reprinted with permission from Quenemoen (2009).
It will be important for IEP teams to carefully consider the implication of choosing a
particular type of adaptation for the student‘s educational program and progress. If the content
is significantly modified, teachers run the risk of unintentionally lowering expectations for the
student. If the student is eligible for the AA-MAS because his or her disability precluded success
in assessments based on grade-level achievement standards [as required in 34 CFR §
200.1(e)(2)(i)], designing instruction that changes the construct may lead teachers to continue a
pattern of working around the disability (i.e., modification) rather than eliminating its influence
Considerations for an AA-MAS
Page 72
through effective supports (i.e., accommodation). Again, the educators on IEP teams will need
to be very skilled in choosing adaptations and evaluating their impact on the student.
State Guidance to IEP Teams
Just as teachers design supports for student learning, states offer supports to IEP teams
so educators can effectively meet the lofty goals described earlier. As a matter of routine, states
already provide guidance to IEP teams on how to manage the ―paperwork‖ aspect of IEPs (i.e.,
adhering to procedural aspects of IDEA, appropriate documentation of required IEP
components). However, now that the IEP is a living document rather than a bureaucratic
formality (Sopko, 2003), states have opportunities to provide a wide array of guidance to help
teams design effective instructional programs as well. Recognizing that districts and individual
schools act as a filter to interpret state guidance, this section highlights supports that states may
wish to provide IEP teams planning for this target population of students.
Decisions about Participation in AA-MAS
Federal regulations to guide eligibility for AA-MAS are detailed in the final AA-MAS
regulations and elsewhere in this volume (see Perie, Chapter 1, this volume). At a minimum,
states are responsible for clearly conveying to the IEP team:
that a student from any disability category may be eligible for AA-MAS [34 CFR
§200.1(f)(1)(ii)]
various options for students with disabilities to participate in statewide assessment
(including assessments based on grade level, modified, or alternate achievement
standards), including the impact of assessment choice on the student‘s educational
options according to State or local policy [34 CFR §200.1(f)(1)(iii)]
clear and appropriate guidelines to determine when AA-MAS is the appropriate
assessment option [34 CFR §200.1(f)(1)(i)]
Considerations for an AA-MAS
Page 73
that students may be assessed against modified achievement standards in one or more
subjects for which assessments are administered under §200.2 [34 CFR §200.1(f)(2)(i)]
Teams must also use a pattern of data to determine the appropriate assessment
participation option (USED, 2007b, p. 18). States may also wish to give IEP teams guidance on
when it would be appropriate for a student to resume participation in state assessments based
on grade-level achievement standards.
Regardless of states‘ eligibility criteria, IEP teams will need clear guidance that allows
them to differentiate in rather nuanced situations. Specific criteria and supporting rationales will
help IEP teams understand how the criteria apply to each student. States may find that a
flowchart, checklist, or other decision support tool may help teams correctly apply the criteria to
arrive at appropriate eligibility determinations. Lazarus, Rogers, Cormier, & Thurlow (2008)
reviewed states‘ AA-MAS participation guidelines and found 15 categories of criteria across nine
states that had an operational AA-MAS. Most frequently occurring were criteria based on the
regulatory language (e.g., must have an IEP; decision not made based on a specific disability
label not progressing at a rate at which the student would be expected to reach grade-level
proficiency within the year; the use of multiple measures to determine previous performance;
student cannot demonstrate knowledge on grade-level assessment even with appropriate
accommodations). Some states also provided exclusion criteria, such as not being based on
placement setting, not being eligible for AA-AAS, and not having a pattern of past performance
attributable to excessive absences or other non-instructional factors (e.g., cultural, language,
economic).
Lazarus et al. (2008) also compiled the materials they reviewed from the nine states into
their synthesis report. As states think about how to provide clear guidance to IEP teams about
appropriate assessment participation decisions, they may wish to adapt some of these
published examples for their own use. For example, Maryland provides an IEP Team DecisionMaking Process Eligibility Tool that teams can use to compile past assessment data, past
Considerations for an AA-MAS
Page 74
instruction, and clear evidence of specific content-area interventions and relevant IEP goals in
the past years. That type of tool could be helpful to teams in determining that the student really
had received effective instruction in prior years. As another example, North Dakota provides an
8-item checklist with an accompanying flowchart to guide teams to the most appropriate largescale assessment option for the student. As more states move closer to deciding whether and
how to provide AA-MAS options, additional examples will likely be forthcoming.
Decisions about Accommodations
IEP teams make choices about what accommodations are appropriate under which
circumstances. Accommodations remove the influence of a disability on a student‘s ability to
show what she or he knows and can do. Accommodations should not be confused with
modifications, which are adaptations that change the construct being taught or assessed. IEP
teams determine appropriate accommodations based on how the disability influences the
individual student‘s participation in the educational program. Accommodation needs may differ
by subject area or type of learning activity, and are determined separately for instruction,
classroom assessment, and statewide assessment. Accommodations typically fall into one of
four categories: presentation format, response format, timing, and setting (AERA, APA, &
NCME, 1999). IEP teams are to choose the right kinds of accommodations for the right student
in order to reduce construct-irrelevant variance and allow for valid inferences about student
performance on assessments. The chosen accommodations should also provide sufficient
support for the student to fully participate in instruction. Accommodations are discussed in more
detail in Abedi (Chapter 9, this volume).
Unfortunately, there is a history of IEP teams having difficulty correctly interpreting
accommodations (Byrnes, 2008) and choosing and applying accommodations (Shriner &
Destefano, 2003) using best-practice principles. There may be confusion on the team about
issues such as when to use accommodations, and the difference between an accommodation
Considerations for an AA-MAS
Page 75
and a modification. When choosing accommodations for AA-MAS, IEP teams are required to
avoid accommodations that would, if used on the grade-level assessment, invalidate the score
(USED, 2007b, p. 27).
IEP teams will also consider the match between accommodations given for instructional
purposes and in large-scale assessments. There is extensive literature supporting the
correspondence between these two, but IDEA gives IEP teams latitude to allow instructional
accommodations that would be considered necessary ―for a student to advance toward attaining
his or her annual goals, to be involved in and make progress in the general curriculum, and to
be educated alongside his or her nondisabled peers‖ [USED, 2007b, pp. 32-33; see also 34
CFR 300.320(a)(4)(i-iii)]. In other words, there is no legislation that precludes the use of
accommodations in instruction that would invalidate results if used on a statewide assessment.
Conversely, states may prohibit the use of accommodations on an assessment if they have not
also been given in instruction. Given this potential mismatch, IEP teams will need to be
deliberate in their choice of instructional accommodations beyond those that will be allowable
for the assessment the student will take.
States can support the appropriate choice of accommodations in several ways. They can
provide full descriptions of the types of accommodations, with examples, organized by category
and with indications about where their use is appropriate (see New York‘s guidelines, for
example: http://www.vesid.nysed.gov/specialed/publications/policy/testaccess/policyguide.htm).
States may also develop decision support materials or even computer-based tools, such as
those described by Kopriva, Koran & Hedgspeth (2007).
Regardless of which accommodations a state approves, the decisions should be made
using the best data available. NCEO offers an online accommodation bibliography
(http://www2.cehd.umn.edu/NCEO/accommodations/), which is a searchable database with
annotated bibliographies that allow for quick scanning of outcomes across studies. As states
deliberate which accommodations to allow under which conditions, they will need a process to
Considerations for an AA-MAS
Page 76
review those potential accommodations. One such process, using a panel review, is described
in Almond & Karvonen (2007). Finally, Thurlow, Christensen, & Lail (2008) summarized a review
of reviewer comments made about accommodations practices described in states‘ peer review
materials. Lessons learned from past peer reviews may be informative to states that implement
new accommodations policies or options for AA-MAS-eligible students.
Other Guidance
State and local educational agencies coordinate and sponsor professional development
for teachers on a wide array of topics. They may not be able to cover every topic described in
this chapter, but an effective needs assessment could reveal areas to prioritize in planning
professional development on topics related to AA-MAS-eligible students. Special education
bureaus may be able to sponsor professional development on topics such as determining
academic priorities, designing learning progressions, differentiating and adjusting instruction,
CBM, and progress monitoring. As new technologies or teaching strategies become available,
professional development plans may be adjusted to incorporate promising practices. Jobembedded professional development (e.g., professional learning communities, instructional
coaches) may be the best approach to support teachers as they develop these instructional
skills (Joyce & Showers, 2002).
Also, since the instructional implications associated with AA-MAS are relatively new,
states should not overlook the importance of addressing teacher beliefs about what this
population of students can do and how to hold them to high expectations (see Quenemoen,
Chapter 2, this volume). States may wish to provide guidance to IEP teams on how to set
interim achievement targets that are lower without precluding students from being eligible for
assessments based on grade-level achievement standards by the time they would need to take
assessments linked to high school graduation requirements. The Access Center (2004)
provides several other suggestions for how states can offer training on IEP writing.
Considerations for an AA-MAS
Page 77
States may also be able to warehouse on their Web sites a wide variety of sample
materials, including well-structured IEP goals and objectives, links to information on researchbased curricula, assistive technologies, and data collection materials for formative assessment.
One example is the site developed by the Georgia Department of Education for teachers of
students who take the state‘s AA-AAS (http://gadoe.georgiastandards.org/impairment.aspx).
These Web sites may also offer a way for teachers to exchange their own created materials
through a community working toward the same goal—figuring out how to do what nobody has
yet done for these students. State-level assessment departments could coordinate resources
related to large-scale assessment. Assessment and Special Education divisions could
collaborate to provide guidance on how to link formative and summative assessments.
Finally, states and LEAs may provide guidance and professional development to
strengthen collaboration among IEP team members. Successful instruction for AA-MAS-eligible
students will require effective collaboration between general educators and special educators. In
schools where the responsibility for educating this population still lays solely with the special
education staff, administrators may want to guide a shift toward the philosophy that the whole
school is responsible. The principal, as instructional leader for all students, may need to send a
strong message in order to prompt this shift.
One other aspect of collaboration should not be neglected. Parents are partners on the
IEP team and must be equipped to be meaningful contributors to the planning that will take
place. They will need to understand what the AA-MAS is, and what participation means for their
children [§200.1(f)(1)(iv)]. If the instructional program under AA-MAS represents a fork in the
road leading away from past practices, parents will also need to understand what to expect
differently in terms of progress monitoring, reporting progress to families, and mid-year
instructional changes. One excellent resource for this information is a parent guide published by
the National Center on Educational Outcomes (Cortiella, 2007).
Considerations for an AA-MAS
Page 78
Systems for Monitoring IEPs
According to IDEA 2004 regulations, the state‘s primary responsibility for monitoring
IEPs includes ―(1) Improving educational results and functional outcomes for all children with
disabilities; and (2) Ensuring that public agencies meet the program requirements under Part B
of the Act, with a particular emphasis on those requirements that are most closely related to
improving educational results for children with disabilities‖ [300.600(b)(1-2)]. While IEP
monitoring tends to focus on procedural compliance, states may need to consider how they will
monitor the IEP‘s role in ensuring that AA-MAS-eligible students receive highly effective
instruction in grade-level content.
Review for substance rather than procedural compliance will be a significant challenge
and states may find such reviews to be resource-intensive. One option is to implement an
educational benefit review, such as the one designed by New York State Department of
Education (2008). This review, conducted with a small sample of IEPs from local education
agencies selected annually on a rotating cycle, focuses on whether the IEP was ―reasonably
calculated for the student to receive educational benefit.‖ (p. 1). The process starts when the
review team extracts information from the IEP about the student‘s educational program,
including services, supports, needs, annual goals, accommodations, and modifications. The
team then analyzes relationships among components of the IEP to determine whether needs,
goals, and services were aligned to promote progress within a 3-year cycle. The comparison
includes relationships across three consecutive, annual IEPs to determine if there is evidence of
educational benefit such as progress toward goal attainment and increased complexity in goals.
Finally, the review team evaluates the results for evidence of best practices and areas for
improvement, and makes a final determination about whether the program was designed to
result in educational benefit for the student.
Greater standardization of IEP content and format will lighten the burden of conducting
substantive IEP reviews. Some states have implemented centralized IEP warehouses with
Considerations for an AA-MAS
Page 79
Web-based interfaces. These systems ease the monitoring process in two ways. IEP contents
are more standardized (or at least organized the same way across districts), and the central
location allows for easier access to facilitate reviews.
Some guidance is emerging on how states might monitor compliance with specific
elements of the regulations. For example, Burdette (2009) conducted surveys and follow-up
interviews with state education agencies regarding their procedures for monitoring IEPs for
evidence of student progress toward annual goals. Another option would be to adapt a
framework like the one Karger (2004) proposed for evaluating how IEPs reflect a student‘s
access to the general curriculum. If a state chose this type of systematic review of the
substantive elements, a consistently-applied content analysis procedure would be established.
Rater training and calibration would be needed to support the quality of judgments for highinference ratings.
Validity Evidence
This chapter has addressed standards-based IEPs as vehicles for designing effective
curriculum and instruction in order to help AA-MAS-eligible students access grade-level content
and make progress toward grade-level proficiency. This section describes how instructional
issues contribute to the validity argument for AA-MAS.
The clearest link between instruction and assessment validity is opportunity to learn
(OTL). If scores on the AA-MAS are intended to reflect modified achievement of grade-level
content standards, one premise is that students have had the opportunity to learn the content
that is assessed. Marion (chapter 9, this volume) offers two sample theories of action that
illustrate how states might articulate the role of instruction in their validity arguments. One
premise in his first example is that ―teachers provide instruction that is aligned with these high
academic expectations and ensure that students get the supports necessary to allow them to
succeed with grade-level content‖ (pp. 320–321). If this statement was supported by evidence,
Considerations for an AA-MAS
Page 80
low scores would reflect limited achievement in the target domain, rather than lack of instruction
in that domain.
For students who take AA-MAS, OTL requires instruction in the targeted content domain,
with supports that removes barriers related to the disability. OTL is a matter of degree rather
than an absolute dichotomy, and can be difficult to measure (AERA, APA, & NCME, 1999,
Standard 13.5). Alignment studies provide one way of assessing correspondence between
curriculum and assessment. Teacher-reported curriculum measures such as Surveys of
Enacted Curriculum (Porter & Smithson, 2001) or the Curriculum Indicators Survey (Karvonen,
Wakeman, Flowers, & Browder, 2007) are a way of incorporating curricular data in alignment
studies.
IEPs also offer a source of evidence related to OTL, and can be used for other validity
investigations as well. Content analysis can yield data such as:
Curricular priorities reflected in the IEP, which can then be evaluated against grade-level
content standards to document instruction in the intended content. Curricular priorities
may also be compared with AA-MAS content, although that type of comparison does not
yield strong ―alignment‖ evidence.
The quality of the instruction program for providing general curriculum access through:
o
Links between present levels of performance and annual goals
o
Correspondence between accommodations provided for instruction & assessment
o
Evaluative judgments about criteria for performance specified in the academic goals,
and whether the expected growth from present level of performance to end of year
goal reflects reasonable (and high) expectations
o
Appropriate use of other learning supports to promote meaningful access and
remove barriers (e.g., ESY services, behavior intervention plans)
Evidence of appropriate decisions made about student participation in AA-MAS.
Especially using retrospective analysis of multiple IEPs, or the IEP decision support tools
Considerations for an AA-MAS
Page 81
described in the eligibility section earlier in this chapter, investigators may determine
whether the student‘s past pattern of instruction really does point to eligibility based on
poor performance despite effective instruction.
Two investigations of this nature are being conducted at the time of this writing, although the
studies have not yet reached the data analysis phase (Karvonen, 2009).
Finally, evidence of the instructional program may be linked to student test scores and
desired student outcomes (AERA, APA, & NCME, 1999, Standard 13.9). While this type of
investigation addresses validity evidence in cases where the score leads to high-stakes
decisions for the individual student (e.g., promotion, graduation), one could argue that the
stakes for AA-MAS-eligible students are high every year if they are to reach grade-level
proficiency in the future. The comment on Standard 13.9 relative to special education is as
follows:
…when test scores are used in the development of specific educational objectives
and instructional strategies, evidence is needed to show that the prescribed
instruction enhances students‘ learning. When there is limited evidence about the
relationship among test results, instructional plans, and student achievement
outcomes, test developers and users should stress the tentative nature of test-based
recommendations…(p. 147)
Thus, if a state chooses a model in which performance on an AA-MAS leads
automatically to a decision about provision of different services or a change in test eligibility, it
would be important to know whether the educational program was designed to promote learning
in the content that was assessed.
Conclusion
The federal guidance on AA-MAS places a greater responsibility for results on the
quality of instruction compared with other types of alternate assessment. Fortunately, the role of
Considerations for an AA-MAS
Page 82
the IEP has become more central and is no longer seen as a compliance document (Sopko,
2003). A good, standards-based IEP can guide educators to support meaningful learning so
students can work toward grade-level proficiency. States have noted benefits to using a
standards-based approach to IEPs including higher levels of student achievement, integration of
special and general education, and benefits to parents who participate in the IEP process
(Ahearn, 2006). Aside from representing the concept of opportunity to learn, there is some
evidence that having IEPs aligned with state academic content standards makes a difference for
instruction (McLaughlin, Nolet, Rhim, & Henderson, 1999) and achievement (Karvonen &
Huynh, 2007).
Challenges and Caveats
This chapter has presented suggestions for IEP teams to set curricular priorities, build
services and supports into the educational program, and guide teachers in developing effective
instruction. The chapter also offers suggestions to states on providing guidance and support to
IEP teams in order to promote highly-effective practices for students who are not achieving at
grade level. These recommendations are intended to help the field move forward toward optimal
practices that help students meet high expectations. In some areas, educators may be prepared
to enact these strategies immediately. For example, there is an extensive body of practice on
how to design and implement formative assessment. In other areas, there is still a significant
gap between current, realistic practice and the ideal. These gaps exist at the state, local, and
school/teacher levels.
States have many competing priorities and will need to decide what is pragmatic in the
systems and supports they establish. State and local agencies may begin to set paths toward
ideal practice by evaluating where they are and how to reach the next logical step. For example,
existing IEP monitoring systems could be reviewed in light of the AA-MAS regulations to
evaluate what elements of the system would need to be changed in order to meet federal
Considerations for an AA-MAS
Page 83
requirements. In order to move the system forward, state stakeholders could define ―ideal‖
practice as it relates to IEP content. They could then develop and pilot that system on a limited
basis, using early data for formative purposes before scaling up.
The largest challenges at the school level lie in the lack of research—both on cognitive
models of learning and content progressions for this population, and on effective instructional
practices. Where there is a longer history of research (e.g., CBM practices), it may be
challenging to distill and disseminate these strategies to teachers in ways that promote change
in their practices. When resources are limited, instructional leaders at the state and local level
may have less time to analyze promising new strategies that are worth disseminating to their
teachers. Even when strong evidence exists for particular strategies, and that evidence is
shared with teachers, there is still the conundrum of interpreting the AA-MAS requirement that
students be taught in grade-level content standards even when they may not have mastered
prerequisite or foundational content. Local agencies may wish to weigh the potential benefits of
focusing their professional development on a broad view for designing and using IEPs in
effective instructional planning versus professional development on separate topics related to
good instruction (e.g., detailed support on translating academic content, how to collect formative
data, how to choose accommodations). Predominant needs may vary by district or school.
As teachers develop IEPs that are consistent with federal mandates for AA-MAS, it is
important to remember that academics are part of the broader educational program.
Instructional time is finite, and there are many competing priorities for these students (Thurlow,
2008). Where possible, teachers will need to think creatively about how to capitalize on
relationships with other parts of the program (e.g., academics embedded in therapeutic or
transition goals; a focus on learning strategies or academic goals that support growth across the
curriculum) and provide instruction that combines academics with other values for the
population. Without these combinations, the risk is that instruction will become more
Considerations for an AA-MAS
Page 84
fragmented. This fragmentation may be detrimental to the goal of moving the student toward
being prepared for assessment against grade-level achievement standards.
The final AA-MAS regulations emphasize the need for this target population of students
to access grade-level content standards rather than modified content standards in order to
maintain high expectations for student learning (USED, 2007b, p. 17,755). There are still many
questions left unanswered about how students can access grade-level content, make more
progress than they have made previously, and have their achievement measured against
modified standards using assessments that allow for valid inferences about what they know and
can do. If students who are eligible for AA-MAS remain identified as persistently low-performing
over several years, what are the consequences for those students in the future? Are they
permanently disadvantaged? Although participation in AA-MAS is not supposed to preclude
participation in requirements for a high school diploma [34 CFR 200.1(f)(2)(iv)], what are the
long-term consequences if the educational program does not help students make up for earlier
learning deficits? Aside from the social justice implications, what are the economic costs for
failing to help students be adequately prepared for successful completion of high school
graduation requirements (Levin, 2009)? As described by Quenemoen (Chapter 2, this volume),
the population of low-performing students also includes many students without disabilities. If
educators successfully design and implement effective instructional programs for struggling
students with disabilities, they also have the potential to lead the way on instruction for all
students who are persistently low-performing — not just those with disabilities. Individual growth
or learning plans would allow teachers to translate the ideas behind the IEP as a means of
instructional planning to the rest of the population of low-performing students, in order to
promote success for all struggling students.
Considerations for an AA-MAS
Page 85
References
Access Center. (2004). Aligning IEPs with state standards and accountability systems.
Washington, DC: American Institutes for Research. Retrieved April 22, 2009 from
http://www.k8accesscenter.org/training_resources/aligningieps.asp
Access Center. (2006). Teaching matters: The link between access to the general education
curriculum and performance on state assessments. Washington, DC: American
Institutes for Research. Retrieved April 22, 2009 from
www.k8accesscenter.org/documents/TeachingMattersBrief_001.pdf
Ahearn, E. (2006, May). Standards-based IEPs: Implementation in Selected States. Alexandria:
Project Forum, National Association of State Directors of Special Education (NASDE).
Retrieved April 21, 2009 from http://www.projectforum.org/docs/Standards-BasedIEPsImplementationinSelectedStates.pdf
Almond, P., & Karvonen, M. (2007). Accommodations for a K to 12 standardized assessment:
Practical implications for policy. In C. Cahalan Laitusis & L. L. Cook (Eds.), Large-scale
assessment and accommodations: What works? (pp. 117-136). Arlington, VA: Council
for Exceptional Children.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education (1999). Standards for educational and
psychological testing. Washington, DC: AERA.
Bateman, B. D., & Herr, C. M. (2006). Writing measurable IEP goals and objectives (2nd ed.).
Verona, WI: Attainment Company.
Browder, D. M., Spooner, F., Wakeman, S., Trela, K., & Baker, J. N. (2006). Aligning instruction
with academic content standards: Finding the link. Research and Practice for Persons
with Severe Disabilities, 31(4), 309-321.
Burdette, P. (2009, February). State tracking to measure student progress toward IEP goals.
Alexandria, VA: National Association of State Directors of Special Education. Retrieved
March 30, 2009 from
http://www.projectforum.org/docs/StateTrackingtoMeasureStudentProgressTowardIEPG
oals.pdf
Byrnes, M. (2008). Educators‘ interpretations of ambiguous accommodations. Remedial and
Special Education, 29(5), 306-315.
California Department of Education (2008, December 18). Educational benefit activity.
Sacramento, CA: Author.
CAST (2008). Universal design for learning (UDL) guidelines - Version 1.0. Wakefield, MA:
Author. Retrieved April 21, 2009 from
http://www.cast.org/publications/UDLguidelines/version1.html
Cortiella, C. (2007). Learning opportunities for your child through alternate assessments:
Alternate assessments based on modified achievement standards. Minneapolis, MN:
Considerations for an AA-MAS
Page 86
University of Minnesota, National Center on Educational Outcomes. Retrieved April 21,
2009 from http://cehd.umn.edu/nceo/OnlinePubs/AAMASParentGuide.pdf
Courtade-Little, G., & Browder, D. (2005). Aligning IEPs to academic standards for students with
moderate and severe disabilities. Verona, WI: IEP Resources.
Deno, S. L. (2003). Developments in curriculum-based measurement. Journal of Special
Education, 37(3), 184-192.
Gibson, V. & Hasbrouck, J. (2008). Differentiated instruction: Grouping for success. Boston:
McGraw-Hill.
Hardman, M. L. & Dawson, S. (2008). The impact of federal public policy on curriculum and
instruction for students with disabilities in the general classroom. Preventing School
Failure, 52(2), 5 – 11.
Hess, K. (2008, June). Developing and using learning progressions as a schema for measuring
progress. Paper presented at 2008 CCSSO National Conference on Student
Assessment, Orlando, FL. Retrieved March 31, 2009 from http://www.nciea.org/cgibin/pubspage.cgi?sortby=pub_date
Hill, W., & Erwin, R. (1984). The readability of content textbooks used in middle and junior high
schools. Reading Psychology, 5(1), 105-117.
Individuals with Disabilities Education Improvement Act of 2004, P. L. No. 108-446, 20 U.S.C.
section 611-614.
Jenkins, J. R., Graff, J. J., & Miglioretti, D. L. (2009). Estimating reading growth using
intermittent CBM progress monitoring. Exceptional Children, 75(2), 151-163.
Jitendra, A. K., Nolet, V., Xin, Y. P., Gomez, O., Renouf, K., Iskold, L., & DaCosta, J. (2001). An
analysis of middle school geography textbooks: Implications for students with learning
problems. Reading & Writing Quarterly, 17(2), 151-173.
Joyce, B. R., & Showers, B. (2002). Student achievement through staff development (3rd ed.).
Alexandria, VA: Association for Supervision and Curriculum Development.
Karger, J. (2004). Access to the general education curriculum for students with disabilities: The
role of the IEP. Wakefield, MA: National Center on Accessing the General Curriculum.
Retrieved from http://www.cast.org/publications/ncac/ncac_iep.html.
Karvonen, M. (2009). IEP content analysis protocols. Cullowhee, NC: Western Carolina
University.
Karvonen, M., & Huynh, H. (2007). The relationship between IEP characteristics and test scores
on alternate assessments for students with significant cognitive disabilities. Applied
Measurement in Education, 20, 273-300.
Karvonen, M., Wakeman, S. L., Flowers, C. P., & Browder, D. M. (2007). Measuring the enacted
curriculum for students with significant cognitive disabilities: A preliminary investigation.
Assessment for Effective Intervention, 33(1), 29-38.
Considerations for an AA-MAS
Page 87
Kopriva, R., Koran, J., & Hedgspeth, C. (2007). Addressing the importance of systematically
matching student needs and test accommodations. In C. Cahalan Laitusis & L. L. Cook
(Eds.), Large-scale assessment and accommodations: What works? (pp. 145-165).
Arlington, VA: Council for Exceptional Children.
Lazarus, S. S., Rogers, C., Cormier, D., & Thurlow, M. L. (2008). States’ participation guidelines
for alternate assessments based on alternate achievement standards (AA-MAS) in 2008
(Synthesis Report 71). Minneapolis, MN: National Center on Educational Outcomes.
Levin, H. M. (2009). The economic payoff to investing in educational justice. Educational
Researcher, 38(1), 5-20. doi: 10.3102/0013189X08331192
McLaughlin, M. J., Nolet, V., Rhim, L. M., & Henderson, K. (1999). Integrating standards:
Including all students. Teaching Exceptional Children, 31(3), 66-71.
NASDE (2007). A seven-step process for creating standards-based IEPs. Available at:
http://cehd.umn.edu/nceo/Teleconferences/AAMASteleconferences/
SevenStepProcess.pdf
National Center on Educational Outcomes [NCEO]. (2007, May 17). Teleconference notes.
Standards-based IEPs and IEP goals based on grade-level standards. Available at:
http://www.education.umn.edu/NCEO/USED2percentTele051707.pdf.
National Center on Response to Intervention (n.d.) What is RTI? Retrieved April 21, 2009 from
http://www.rti4success.org/
National Council of Teachers of Mathematics. (2000). Principles and standards for school
mathematics. Reston, VA: Author.
National Reading Panel. (2000). Report of the National Reading Panel, teaching children to
read: An evidence-based assessment of the scientific research literature on reading and
its implications for reading instruction. Washington, DC: National Institute for Literacy.
New York State Education Department (2008, December 23). Special education quality
assurance: IDEA effective instructional practices focused review manual. Albany, NY:
New York State Education Department, Office of Vocational & Educational Services for
Individuals with Disabilities.
Nitko, A. J. (2004) Formative evaluation using informal diagnostic assessments. In Educational
assessment of students (4th ed., pp. 288-303). Upper Saddle River, NJ: Pearson
Education.
Perie,M. (2009, February). Understanding the AA-MAS: How does it fit into a state assessment
and accountability system? Presentation to the SCASS groups, Orlando, FL. Retrieved
March 26, 2009 from http://www.nciea.org/publications/CrossSCASS_MAP09.pdf
Porter, A. C., & Smithson, J. L. (2001, December). Defining, developing, and using curriculum
indicators (CPRE Research Report Series RR-048). Retrieved April 22, 2009 from
http://www.cpre.org/images/stories/cpre_pdfs/rr48.pdf
Considerations for an AA-MAS
Page 88
Quenemoen, R. (2009). Success one student at a time: What the IEP team does. Minneapolis,
MN: University of Minnesota, National Center on Educational Outcomes. Retrieved April
20, 2009 from http://www.nceo.info/Tools/StandardsIEPtool.pdf
Quenemoen, R., Thurlow, M., Moen, R., Thompson, S. & Morse, A. B. (2004). Progress
monitoring in an inclusive standards-based assessment and accountability system
(Synthesis Report 53). Minneapolis, MN: University of Minnesota, National Center on
Educational Outcomes. Retrieved April 20, 2009 from
http://education.umn.edu/NCEO/OnlinePubs/Synthesis53.html
Shriner, J., & Destefano, L. (2003). Participation and accommodation in state assessment: The
role of Individualized Education Programs. Exceptional Children, 69(2), 147-161.
Sopko, K. M. (2003). The IEP: A synthesis of current literature since 1997. Alexandria, VA:
National Association of State Directors of Special Education, Project FORUM. Retrieved
April 21, 2009 from http://www.projectforum.org/docs/iep.pdf
Soukup, J. H., Wehmeyer, M. L., Bashinski, S., & Bovaird, J. A. (2007). Classroom variables
and access to the general curriculum for students with disabilities. Exceptional Children,
74(1), 101-120.
Stecker, P. M. (n.d.). Monitoring student progress in Individualized Educational Programs using
curriculum-based measurement. Washington, DC: National Center on Student Progress
Monitoring. Retrieved April 21, 2009 from
http://www.studentprogress.org/library/monitoring_student_progress_in_individualized_e
ducational_programs_using_cbm.pdf.
Stecker, P. M., Fuchs, L. S., & Fuchs, D. (2005). Using curriculum-based measurement to
improve student achievement: Review of research. Psychology in the Schools, 42(8),
795-819. doi: 10.1002/pitts.20113
Thurlow, M. L. (2008). Assessment and instructional implications of the alternate assessment
based on modified academic achievement standards (AA-MAS). Journal of Disability
Policy Studies, 19(3), 132-139. doi: 10.1177/1044207308327473
Thurlow, M., L., Christensen, L. L., & Lail, K. E. (2008). An analysis of accommodations issues
from the standards and assessments peer review (Technical Report 51). Minneapolis,
MN: University of Minnesota, National Center on Educational Outcomes. Retrieved April
20, 2009 from http://cehd.umn.edu/NCEO/OnlinePubs/Tech51/index.htm
U.S. Department of Education (2007a, April 9). Final Rule 34 CFR Parts 200 and 300: Title I—
Improving the Academic Achievement of the Disadvantaged; Individuals with Disabilities
Education Act (IDEA). Federal Register. 72(67), Washington DC: Author. Available at
http://www.ed.gov/admins/lead/account/saa.html#regulations
U.S. Department of Education (2007b, July 20), Modified Academic Achievement Standards:
Non-regulatory Guidance. Washington, DC: Office of Elementary and Secondary
Education, U.S. Department of Education. Available at:
http://www.ed.gov/admins/lead/account/saa.html#regulations.
Considerations for an AA-MAS
Page 89
CHAPTER 4
THE CHALLENGES OF CONCEPTUALIZING WHAT LOW ACHIEVERS KNOW AND
HOW TO ASSESS THEIR COMPETENCE
Jim Pellegrino
This chapter considers some of the most important issues surrounding the ―what‖ and
―how‖ of assessment as applied to any population of students, but especially those students
who fall in the range of low academic achievement as measured by the typical achievement
tests used for purposes of NCLB accountability. It is intended as a bridge between the prior two
chapters, with their focus on identification of those students whose academic performance is
such that assessment relative to modified achievement standards may be appropriate, and the
next section of this report with its focus on issues regarding the content and design of any such
assessment.
The first section of this chapter is concerned with two ―big ideas‖ that are essential to the
development, implementation and use of assessments for any group of students: (1)
understanding relationships among assessment, curriculum and instruction, and (2)
conceptualizing assessment as a process of reasoning from evidence that should be driven by
theories and data on student cognition. The second section then elaborates on one key element
from section one — those aspects of cognition that underlie student knowledge and
performance in subject matter domains and that must be considered in the design and
interpretation of student assessments. The third section builds from the prior two. It focuses on
important aspects of student cognition and provides examples of their development in the
instructional domains of reading and mathematics. The fourth section considers the implications
of the preceding sections for multiple facets in the design of valid and useful student
assessments. The final section concludes by considering the validity of assessment practices
that might be considered for this population of students in light of the content of preceding
Considerations for an AA-MAS
Page 90
sections. It also suggests some things we still need to know to make progress in the design of
valid assessments based on modified achievement standards.
Two Critical Issues for Conceptualizing Student Assessment
The Curriculum-Instruction-Assessment Triad
Whether we recognize it or not, assessment does not and should not stand alone.
Rather, it is one of three central components in the educational enterprise — curriculum,
instruction, and assessment. The three elements of this triad are linked, although the nature of
their linkages and reciprocal influence is often less explicit than it should be. Furthermore, the
separate pairs of connections are often inconsistent which can lead to an overall incoherence in
the educational enterprise.
Curriculum consists of the knowledge and skills in subject matter areas that teachers
teach and students are supposed to learn. The curriculum generally consists of a scope or
breadth of content in a given subject area and a sequence for learning (Pugalee & Rickelman,
Chapter 5, this volume). Content standards in each subject matter area typically outline the
goals of learning, whereas curriculum sets forth the more specific means to be used to achieve
those ends. Instruction refers to methods of teaching and the learning activities used to help
students master the content and objectives specified by a curriculum. Instruction encompasses
the activities of both teachers and students. It can be carried out by a variety of methods,
sequences of activities, and topic orders. Assessment is the means used to measure the
outcomes of education and the achievement of students with regard to important competencies.
Assessment may include both formal methods, such as large-scale state assessments, or less
formal classroom-based procedures, such as quizzes, class projects, and teacher questioning.
A precept of educational practice is the need for alignment among curriculum,
instruction, and assessment (e.g., NCTM, 1995, 2000; Webb, 1997). Alignment, in this sense,
means that the three functions are directed toward the same ends and reinforce each other
Considerations for an AA-MAS
Page 91
rather than working at cross-purposes. Ideally, an assessment should measure what students
are actually being taught, and what is actually being taught should parallel the curriculum one
wants students to master. If any of the functions is not well synchronized, it will disrupt the
balance and skew the educational process. Assessment results will be misleading, or instruction
will be ineffective. Alignment is difficult to achieve, however. Often what is lacking is a central
theory about the nature of learning and knowing around which the three functions can be
coordinated.
Most current approaches to curriculum, instruction, and assessment are based on
theories and models that have not kept pace with modern knowledge of cognition and how
people learn (e.g., NRC, 1999a, 1999b, 1999c, 2000, 2001a; Shepard, 2000). They have been
designed on the basis of implicit and highly limited conceptions of cognition and learning. Those
conceptions tend to be fragmented, outdated, and poorly delineated for domains of subject
matter knowledge. Alignment among curriculum, instruction, and assessment could be better
achieved if all three are derived from a scientifically credible and shared knowledge base about
cognition and learning in subject matter domains. The model of learning would provide the
central bonding principle, serving as a nucleus around which the three functions would revolve.
Without such a central core, and under pressure to prepare students for the accountability tests,
teachers may feel compelled to move back and forth between instruction and external
assessment and teach directly to the items on a state test. This approach can result in an
undesirable narrowing of the curriculum and a limiting of learning outcomes. Such problems can
be ameliorated if, instead, decisions about both instruction and assessment are guided by a
model of learning in the domain that represents the best available scientific understanding of
how people learn (NRC, 2000).
Considerations for an AA-MAS
Page 92
Assessment as a Process of Reasoning from Evidence
Educators assess students to learn about what they know and can do, but assessments
do not offer a direct pipeline into a student‘s mind. Assessing educational outcomes is not as
straightforward as measuring height or weight; the attributes to be measured are mental
representations and processes that are not outwardly visible. Thus, an assessment is a tool
designed to observe students‘ behavior and produce data that can be used to draw reasonable
inferences about what students know. Deciding what to assess and how to do so is not as
simple as it might appear.
The process of collecting evidence to support inferences about what students know
represents a chain of reasoning from evidence about student learning that characterizes all
assessments, from classroom quizzes and standardized achievement tests, to computerized
tutoring programs, to the conversation a student has with her teacher as they work through a
math problem or discuss the meaning of a text. In the 2001 report Knowing What Students
Know: The Science and Design of Educational Assessment issued by the National Research
Council, the process of reasoning from evidence was portrayed as a triad of three
interconnected elements — the assessment triangle (NRC, 2001a). The vertices of the
assessment triangle represent the three key elements underlying any assessment: a model of
student cognition and learning in the domain of the assessment; a set of beliefs about the kinds
of observations that will provide evidence of students‘ competencies; and an interpretation
process for making sense of the evidence. These three elements may be explicit or implicit, but
an assessment cannot be designed and implemented without some consideration of each. The
three are represented as vertices of a triangle because each is connected to and dependent on
the other two. A major tenet of the Knowing What Students Know report is that for an
assessment to be effective and valid, the three elements must be in synchrony. The
assessment triangle provides a useful framework for analyzing the underpinnings of current
Considerations for an AA-MAS
Page 93
assessments to determine how well they accomplish the goals we have in mind, as well as for
designing future assessments.
The cognition corner of the triangle refers to a theory or set of beliefs about how
students represent knowledge and develop competence in a subject domain (e.g., fractions). In
any particular assessment application, a theory of learning in the domain is needed to identify
the set of knowledge and skills that is important to measure for the task at hand, whether that be
characterizing the competencies students have acquired thus far or guiding instruction to further
increase learning. A central premise is that the cognitive theory should represent the most
scientifically credible understanding of typical ways in which learners represent knowledge and
develop expertise in a domain. More will be said in the next section about what we know about
the nature of cognition and the development of subject matter competence.
Every assessment is also based on a set of beliefs about the kinds of tasks or situations
that will prompt students to say, do, or create something that demonstrates important
knowledge and skills. The tasks to which students are asked to respond on an assessment are
not arbitrary. They must be carefully designed to provide evidence that is linked to the cognitive
model of learning and to support the kinds of inferences and decisions that will be made on the
basis of the assessment results. The observation vertex of the assessment triangle represents a
description or set of specifications for assessment tasks that will elicit illuminating responses
from students. In assessment, one has the opportunity to structure some small corner of the
world to make observations. The assessment designer can use this capability to maximize the
value of the data collected, as seen through the lens of the underlying beliefs about how
students learn in the domain.
Every assessment is also based on certain assumptions and models for interpreting the
evidence collected from observations. The interpretation vertex of the triangle encompasses all
the methods and tools used to reason from fallible observations. It expresses how the
observations derived from a set of assessment tasks constitute evidence about the knowledge
Considerations for an AA-MAS
Page 94
and skills being assessed. In the context of large-scale assessment, the interpretation method is
usually a statistical model, which is a characterization or summarization of patterns one would
expect to see in the data given varying levels of student competency. In the context of
classroom assessment, the interpretation is often made less formally by the teacher, and is
usually based on an intuitive or qualitative model rather than a formal statistical one.
A crucial point is that each of the three elements of the assessment triangle not only
must make sense on its own, but also must connect to each of the other two elements in a
meaningful way to lead to an effective assessment and sound inferences. Thus to have an
effective assessment, all three vertices of the triangle must work together in synchrony. Central
to this entire process, however, are theories and data on how students learn and what students
know as they develop competence in aspects of the curriculum.
Fundamental Components of Cognition and Some Implications for Assessment
This section begins the process of specifying some of what we know about the nature of
human cognition that has implications for instruction, learning, and assessment. We begin at the
level of what is general about the human cognitive system in terms of how it processes
information and the types of knowledge and skill that are developed over time through
instruction and practice. This ―generic‖ description is meant to apply to the minds of virtually all
individuals with the possible exception of those with the most severe cognitive impairments. The
utility of such a generic description is that it provides a basis for considering some of the
sources of differences in learning and performance that might be associated with persistently
low levels of academic achievement. We consider such implications at the end of this section
before moving to Section III where we examine some of what we know about specific aspects of
cognition and learning in the domains of reading and mathematics and what that might imply for
assessment.
Considerations for an AA-MAS
Page 95
Working Memory
One of the chief theoretical advances to emerge from cognitive research is the notion of
cognitive architecture--the information processing system that determines the flow of information
and how it is acquired, stored, represented, revised, and accessed in the mind. One of the most
critical components of the cognitive architecture is working memory. It has been conceptualized
as the system we use to process and act on information that is immediately before us and that
we are consciously processing (Baddeley, 1986). Rather than viewing working memory as a
―place‖ in the cognitive system, contemporary theoretical work has conceptualized working
memory as a kind of cognitive energy level or ―resource‖ that exists in limited amounts, with
substantial individual variations. It is a well established fact that there are reliable developmental
and individual differences in working memory capacity that predict a range of cognitive
outcomes including scores on conventional tests of intelligence and achievement (e.g.,
Unsworth & Engle, 2007).
A significant aspect of the construct of working memory is that it plays a central role in
virtually any cognitive activity we can imagine, determining the success or failure of many if not
most of our intellectual endeavors. The range of activities that are impacted by the capacity and
efficiency of one‘s working memory includes such things as executing a procedure like
multicolumn addition or subtraction while monitoring the products of the process and the
sequential steps; the act of representing and learning a new procedure like learning to ―borrow
across zero‖ in multicolumn subtraction; and the process of reading and comprehending a piece
of narrative or expository text, including activities such as resolving issues of reference and
making inferences (e.g., Miyake, Just, and Carpenter, 1994).
Metacognition
The term metacognition (literally ―thinking about thinking‖) is commonly used to refer to
the selection and monitoring processes, as well as to more general activities of reflecting on and
Considerations for an AA-MAS
Page 96
directing one‘s own thinking. Good learners have strong metacognitive skills (Hatano, 1990).
They monitor their problem-solving, question limitations in their knowledge, and avoid
oversimplistic interpretations of a problem. In the course of learning and problem-solving, such
individuals display certain kinds of regulatory performance such as knowing when to apply a
procedure or rule, predicting the correctness or outcomes of an action, planning ahead, and
efficiently apportioning cognitive resources and time.
There is ample evidence that metacognition develops over the school years; for
example, older children are better than younger ones at planning for tasks they are asked to do.
Metacognitive skills can also be taught. For example, people can learn mental devices that help
them stay on task, monitor their own progress, reflect on their strengths and weaknesses, and
self-correct errors. It is important to note, however, that the teaching of metacognitive skills is
often best accomplished in specific content areas since the ability to monitor one‘s
understanding is closely tied to domain-specific knowledge and expertise.
Types of Knowledge and Processes of Acquisition
Long-term memory contains two distinct types of information — information about ―the
way the world is‖ (declarative knowledge) and procedural information about ―how things are
done‖ (procedural knowledge). It is one thing to know what it means to throw a 90-mile-per-hour
fastball for a strike in baseball and quite another to be able to actually do it! Knowing about
something (making a soufflé) is not the same as actually being able to do that thing. Much of
what we would like students to learn in school is a combination of both declarative and
procedural knowledge and for both types of knowledge we want them to access and use that
knowledge in a highly fluent and relatively automatic fashion.
Unlike working memory, long-term memory is, for all practical purposes, an effectively
limitless store of information. What matters most in learning situations is not the capacity of
working memory—although that is often a factor in the speed and/or accuracy of processing—
Considerations for an AA-MAS
Page 97
but how well one can evoke the knowledge stored in long-term memory and use it to reason
efficiently about information and solve problems in the present.
As part of studying the nature of knowledge in long-term memory, researchers have probed
deeply the nature of competence and how people acquire large bodies of knowledge over long
periods of time. Studies have revealed much about the kinds of mental structures that support
problem-solving and learning in various domains; what it means to develop competence in a
domain; and how the thinking of high achievers differs from that of novices (e.g., Chi, Feltovich, &
Glaser, 1981). What distinguishes high from low performers is not simply general mental abilities,
such as memory or fluid intelligence, or general problem-solving strategies. High performers have
acquired extensive stores of knowledge and skill in a particular domain. But perhaps most
significant, their minds have organized this knowledge in ways that make it more retrievable and
useful. Because their knowledge has been encoded in a way that closely links it with the contexts
and conditions for its use, high achievers do not have to search through the vast repertoire of
everything they know when confronted with a problem. Instead, they can readily activate and
retrieve the subset of their knowledge that is relevant to the task at hand (Simon, 1980; Glaser,
1992). Such findings suggest that teachers should place more emphasis on the conditions for
applying the facts or procedures being taught, and that assessment should address whether
students know when, where, and how to use their knowledge.
Considerable effort has also been expended on understanding the characteristics of
persons and of the learning situations they encounter that foster the development of expertise.
Much of what we know about the development of expertise has come from studies of children
as they acquire competence in many areas of intellectual endeavor, including the learning of
school subject matter. (This is further discussed in the section titled ―Domain Specific Aspects of
Cognition and Learning.‖) From a cognitive standpoint, development and learning are not the
same thing. Some types of knowledge are universally acquired in the course of typical
development, while other types are learned only with the intervention of deliberate teaching
Considerations for an AA-MAS
Page 98
(which includes teaching by any means, such as apprenticeship, formal schooling, or selfstudy). Infants and young children appear to be predisposed to learn rapidly and readily in some
domains, including language, number, and notions of physical and biological causality. Infants
who are only 3 or 4 months old, for example, have been shown to understand certain concepts
about the physical world, such as the idea that inanimate objects need to be propelled in order
to move (Massey & Gelman, 1988). By the time children are 3 or 4 years old, they have an
implicit understanding of certain rudimentary principles for counting, adding, and subtracting
cardinal numbers (Gelman, 1990; Gelman & Gallistel, 1978).
In math, the fundamentals of ordinality and cardinality appear to develop in all nondisabled human infants without instruction. In contrast, however, such concepts as
mathematical notation, algebra, and Cartesian graphing representations must be taught.
Similarly, the basics of speech and language comprehension emerge naturally from millions of
years of evolution, whereas mastery of the alphabetic code necessary for reading typically
requires explicit instruction and long periods of practice (Geary, 1995). Much of what we want to
assess in educational contexts is the product of such deliberate learning.
With respect to assessment, one of the most important findings from detailed
observations of children‘s learning behavior is that children do not move simply and directly from
an erroneous to an optimal solution strategy (Kaiser, Proffitt, and McCloskey, 1985). Instead,
they may exhibit several different but locally or partially correct strategies (Fay and Klahr, 1996).
They also may use less advanced strategies even after demonstrating that they know more
advanced ones, and the process of acquiring and consolidating robust and efficient strategies
may be quite protracted, extending across many weeks and hundreds of problems (Siegler,
1998). These studies have also found, moreover, that short-term transition strategies often
precede more lasting approaches and that generalization of new approaches often occurs very
slowly.
Considerations for an AA-MAS
Page 99
The Role of Practice and Feedback
Every domain of knowledge and skill has its own body of concepts, factual content,
procedures, and other items that together constitute the knowledge of that field. In many
domains, including areas of mathematics and science, this knowledge is complex and
multifaceted, requiring sustained effort and focused instruction to master. Developing deep
knowledge of a domain such as that exhibited by high achievers, along with conditions for its
use, takes time and focus and requires opportunities for practice with feedback.
Whether considering the acquisition of some highly specific piece of knowledge or skill
such as the process of adding two numbers, or some larger schema for solving a mathematics
or physics problem, certain laws of skill acquisition always apply. The first of these is the power
law of practice: acquiring skill takes time, often requiring hundreds or thousands of instances of
practice in retrieving a piece of information or executing a procedure. This law operates across
a broad range of tasks, from typing on a keyboard to solving geometry problems (Anderson,
1981; Rosenbloom & Newell, 1987). According to the power law of practice, the speed and
accuracy of performing a simple or complex cognitive operation increases in a systematic
nonlinear fashion over successive attempts. This pattern is characterized by an initial rapid
improvement in performance, followed by subsequent and continuous improvements that
accrue at a slower and slower rate.
Practice, however, is not enough to ensure that a skill will be acquired. The conditions
of practice are also important. The second major law of skill acquisition involves knowledge of
results. Individuals acquire a skill much more rapidly if they receive feedback about the
correctness of what they have done. If incorrect, they need to know the nature of their mistake
(Thorndike, 1931). One of the persistent dilemmas in education is that students often spend
time practicing incorrect skills with little or no feedback. Furthermore, the feedback they
ultimately receive is often neither timely nor informative. Unguided practice (e.g., homework in
math) can be for the less able student, practice in doing tasks incorrectly. One of the most
Considerations for an AA-MAS
Page 100
important roles for assessment is the provision of timely and informative feedback to students
during instruction and learning so that their practice of a skill and its subsequent acquisition will
be effective and efficient (Black & Wiliam, 1998; Sadler, 1989; Wiliam, 2007).
The Role of Social Context, Cultural Norms, and Student Beliefs
Much of what humans learn is acquired through discourse and interactions with others.
For example, science, mathematics, and other domains are often shaped by collaborative work
among peers. Through such interactions, individuals build communities of practice, test their
own theories, and build on the learning of others. For example, those who are still using a naive
strategy can learn by observing others who have figured out a more productive one. This
situation contrasts with many school situations, in which students are often required to work
independently or even competitively. Yet the display and modeling of cognitive competence
through group participation and social interaction is an important mechanism for the
internalization of knowledge and skill in individuals (Rogoff, 1990). Studies suggest that much
of knowledge is also highly ―situated‖—it is embedded within systems of representation,
discourse, and physical activity. A part of developing competence is learning to participate in
communities of practice which in turn serve as sites for developing identity as a member of
various communities and what happens in those communities as enabled by a variety of
artifacts and tools (Lave, 1988).
The beliefs students hold about learning are another social dimension that can
significantly affect learning and performance (e.g., Dweck & Legitt, 1988). For example, many
students believe, on the basis of their typical classroom and homework assignments, that any
math problem can be solved in 5 minutes or less, and if they cannot find a solution in that time,
they will give up. Many young people and adults also believe that talent in mathematics and
science is innate, which gives them little incentive to persist if they do not understand something
Considerations for an AA-MAS
Page 101
in these subjects immediately. Conversely, people who believe they are capable of making
sense of unfamiliar things often succeed because they invest more sustained effort in doing so.
If mathematics is presented by a teacher as a set of rules to be applied, students may
come to believe that ―knowing‖ math means remembering which rule to apply when a question
is asked (usually the rule the teacher last demonstrated), and that comprehending the concepts
that underlie the question is too difficult for ordinary students. In contrast, when teachers
structure math lessons so that important principles are apparent as students work through the
procedures, students are more likely to develop deeper understanding and become
independent and thoughtful problem-solvers (Lampert, 1986).
Some Implications for Low-Achieving/performing Students
What are some possible implications of the cognitive architecture and the nature of
knowledge and its development for understanding the performance of low-achieving students?
It would nice if we could provide definitive answers to such a question but in many cases we
lack a research base that allows us to do so. Nevertheless we can speculate on some of the
possible causes of low performance and the implications for both instruction and assessment.
For example, some of the problem may be an information processing bottleneck issue,
especially as regards the capacity of working memory and the management of attentional
resources. Such a bottleneck has implications for the processes of learning and knowledge
acquisition as well as for performance in a testing complex. It may well be the case that the
ability to integrate content and to proceduralize knowledge, which are key aspects of the
process of learning, are slowed or impaired by limitations in basic processing capacities. This is
not to say that individuals cannot acquire the knowledge that is intended but rather that the
speed and conditions needed to do so may differ. Similarly, differences in performance in a
testing situation may have little to do with the availability of the appropriate knowledge but the
load on working memory that taxes the person‘s capacity to manage the situation within the time
Considerations for an AA-MAS
Page 102
demands of the testing situation. Similar issues arise regarding aspects of metacognition and
the capacity to develop and/or exercise such skills in a given learning or performance situation.
Without convincing evidence that it is the architecture per se that contributes to low
achievement, it is reasonable to assume that much of the problem of low achievement
represents a deficit in the nature of the forms of knowledge that are demanded by different
areas of the curriculum. This almost sounds tautological — low achievement by definition
means lack of knowledge. But low achievement may not be associated with a lack of knowledge
per se but a failure to develop the forms of knowledge that are associated with higher levels of
competence and performance. If students perform poorly on tests of domain-specific
achievement it is appropriate to ask how much of the problem may result from a failure of
sufficient opportunity to learn the content required to attain higher levels of competence. In turn,
much of that deficiency might be a function of the failure to make explicit for such students that
which is often tacit in the learning situation and more readily discerned or inferred by nondisabled students. Learning is a process of constructing knowledge and such a constructive
process occurs regardless of the forms of instruction—from guided discovery and hands-on
experiences to collaborative learning to direct instruction to rote memorization. Because
knowledge is constructed rather than delivered there is always a potential gap between what
was intended in the instructional environment and what was actually understood and
represented by the student.
As noted earlier, the development of knowledge is constituted within particular contexts
and situations — an ―interactionist‖ perspective of development (Newcombe & Huttenlocher,
2000). Accordingly, assessment of children‘s development in contexts of schooling should
include attention to the nature of classroom cultures and the practices they promote, as well as
to individual variation. For example, the kinds of expectations established in a classroom for
what counts as a mathematical explanation or what serves as a summarization or interpretation
of a text affect the kinds of strategies and explanations that children pursue and the kinds of
Considerations for an AA-MAS
Page 103
responses they are likely to give in an assessment context. Because knowledge is constructed
from experience, we may need to pay more attention to the nature of the experiences that lowachieving students encounter as the conditions of learning and how those experiences align
with the conditions and expectations for performance in an assessment context. In essence, any
assessment or testing situation is a test of transfer (Ruiz-Primo et al., 2002). What is near
transfer for some students may be far transfer for others given the conditions of learning and the
situated as well as socio-cultural nature of their knowledge (see, e.g., Hickey & Pellegrino,
2005; Pellegrino & Hickey, 2006).
Despite all we know about cognition, we must remind ourselves that there are many
questions yet to answer about the ways in which low-achieving students differ from their regular
education peers and the possible causes as well as consequences. Some of the possible
answers lie in a better understanding of the nature of knowledge and skill in specific curricular
domains and how that develops over time and with instruction.
Domain Specific Aspects of Cognition and Learning
Detailed models of cognition and learning in specific curricular areas can be invaluable
for evaluating the progress of any individual or group, as well as for informing teaching and
learning. In other words, a well-developed and empirically validated model of thinking and
learning in an academic domain can be used to design and select assessment tasks that
support the analysis of various kinds of student performance. Models with power highlight the
main determinants of and obstacles to learning and include descriptions of students‘ conceptual
progressions as they develop competence and expertise. Consistent with these ideas, there has
been a recent spurt of interest in the topic of ―learning progressions‖ (see NRC 2005, 2007).
Learning progressions describe ―successively more sophisticated ways of reasoning within a
content domain that follow one another as students learn‖ (Smith et al., 2006).
Considerations for an AA-MAS
Page 104
Duncan and Hmelo-Silver (2009) have provided a description of ―essential features‖ of
learning progressions that attempts to capture an emerging consensus derived from panel
discussions organized by the Center on Continuous Instructional Improvement and the
Consortium for Policy Research in Education (Corcoran, Mosher, & Rogat, 2009). As described
by Duncan and Hmelo-Silver (2009), there are four essential features that define something as
a learning progression. First, learning progressions are focused on a few foundational and
generative disciplinary ideas and practices. Several researchers have argued that it is the
combined focus on content and practice which is unique to the current definition of learning
progressions and central to the development of scientific literacy (Smith et al., 2006). Second,
these progressions are bounded by an upper anchor describing what students are expected to
know and be able to do by the end of the progression and by a lower anchor describing
assumptions about the prior knowledge and skills of learners as they enter the progression. The
upper anchor is informed by analyses of the domain as well as societal expectations. Third,
they describe varying levels of achievements as the intermediate steps between the two
anchors. These levels are derived from syntheses of existing research on student learning in the
domain as well as empirical studies of the progression (such as cross sectional studies and
teaching experiments). Levels of achievement are provided in the form of learning performances
that can serve as evidence of students‘ level of understanding and competency. Fourth, learning
progressions are mediated by targeted instruction and curriculum. They are not developmentally
inevitable and as such do not describe learning as it naturally develops in the absence of
scaffolded curriculum and instruction.
Below we consider some of what we currently know about the components of
competence and the progression of learning in the domains of mathematics and reading. We
are not offering these descriptions as learning progressions that meet the criteria outlined
above, but as illustrations of what we do know about how knowledge and competence develops
over time and with instruction for certain aspects of the domains of reading and mathematics. In
Considerations for an AA-MAS
Page 105
considering the information that is provided about the sequence of learning and cognitive
development we must remind ourselves that two pertinent questions, which we need to answer
empirically, are whether low-performing students can be characterized as simply lagging behind
in the pace of their development and whether they follow the same or different progressions.
Clearly, being able to answer such questions is essential to the process of better educating
these students as well as for providing valid and fair assessments that are tied to appropriately
modified achievement standards that in turn have coherence within and across grade levels.
K-8 Reading
There is an unusual degree of consensus regarding the goals of early reading
instruction. The consensus is captured in the National Research Council Report, Preventing
Reading Difficulties in Young Children (NRC, 1998), and in the report of the National Reading
Panel, Teaching Children to Read (NICHD, 2000). The goals are often expressed in terms of
the competencies children should be able to demonstrate at the end of grade three: (a) read
age-appropriate literature independently with pleasure and interest; (b) read age-appropriate
explanatory texts with comprehension for the purpose of learning; and (c) talk and write about
those texts in age-appropriate ways. Achieving these goals requires simultaneous development
of an interdependent set of abilities: decoding skills, reading fluency, oral language
development, vocabulary development, comprehension skills, and the ability to encode speech
into writing.
The foundation for early reading lies in the earlier, informal acquisition of language. With
little effort, children with intact neurological systems acquire the sounds of their language, its
vocabulary, and its methods of conveying meaning (NRC, 1998). The path that children travel in
acquiring language is predictable (NRC, 1998), though the age at which particular skills and
abilities are mastered varies somewhat. As proficiency with language use grows, children
develop the ability to think about language. Before that ability develops, children do not
Considerations for an AA-MAS
Page 106
distinguish between the word and the object to which it refers. Children can begin to develop
rudimentary metalinguistic skills as early as age three. Acquiring this ability allows children to
play with, analyze, and pass judgment on the correctness of language.
The trajectory of language development as described above is universal, though the
richness of the environment affects the pace and extent of language development powerfully
(Hart & Risley, 1995; Huttenlocher, 1998). For example, Graves and Slater (1987) found that
first-graders from higher-income families had a vocabulary that was double the size of those
from low-income families. The differences are highly relevant because verbal ability generally,
and vocabulary development particularly, are good predictors of success in early reading.
While typical language development supports reading acquisition, other abilities required
for effective reading mastery are unlikely to develop unless children receive formal instruction.
With few exceptions, children need systematic instruction in the alphabetic principle to learn to
decode words, and to learn how to encode words in writing (Adams et al., 1998). This
instruction is what is referred to as ―phonics.‖ But successful phonics instruction rests on a more
fundamental ability: phonemic awareness. This is the awareness, for example, that the word
―cat‖ consists of three separable sounds — c/ a/ t. The distinction is important because phonics
instruction that teaches the mapping of separate sounds onto letters requires for success that a
student hear those separate sounds.
Learning the alphabetic principle is prerequisite to reading. However it is not nearly
sufficient to help children reach the desired third-grade competencies. Phonics instruction must
be integrated with comprehension instruction, opportunities to develop fluency in reading
through practice, instruction to enhance and practice oral and written language abilities, and
opportunities to acquire rich vocabulary and background knowledge. The failure of any one of
these will result in falling short of the third-grade goals. If fluency does not develop, little
meaning is taken from a text that a student must plod through. If background knowledge is
Considerations for an AA-MAS
Page 107
inadequate, even a fluent reader will be unable to engage with and learn from the text. The
components of successful reading are tightly intertwined.
In addition to building vocabulary, oral language instruction can extend a child‘s ability to
understand and use academic, or literate, language. This is the decontextualized language that
minimizes contextual cues and shared assumptions (e.g., by explicitly encoding referents for
pronouns, actions, and locations (Olson, 1977). These extensions of discourse into the
decontextualized register of academic language are what predict literacy success into middle
school, controlling for home variables (Dickinson & Sprague, 2001). These relationships
between preschool oral language and middle-school reading comprehension are clearly
mediated by decoding instruction in the primary grades (Whitehurst & Lonigan, 2001). But the
point is that language intervention that builds vocabulary and decontextualized language
structures needs to occur prior to and during decoding instruction, rather than later.
Writing is at the heart of mastering the alphabetic system. Writing starts with the
encoding of speech to print. The ability to phonemically segment sounds in speech and
represent them in conventional writing develops over time. A complete representation of a
word‘s spelling in memory developed through writing will enhance the speed and accuracy with
which it is recognized (Ehri, 1998; Perfetti, 1992). Thus, the writing of words supports the
reading of words and, over time, builds toward the writing of text, which can support the
comprehension of text.
In addition to understanding the contributors to successful reading acquisition, there is
also an extensive research base on the typical hurdles that children encounter (NRC, 1998;
NICHD, 2000). It is now well established that a significant number of children have difficulty
learning the alphabetic principle because they have not developed phonemic awareness.
Among children who learn to decode words but do not comprehend well, fluency is often the
culprit; if children struggle slowly through a text, their comprehension when they have finished
will be poor. Fluency can suffer if children spend too little time actively engaged in effective
Considerations for an AA-MAS
Page 108
reading practice, or if vocabulary and background knowledge are too weak to allow the student
to read with understanding (Lesgold & Perfetti, 1981).
In contrast to the above, in the area of reading comprehension much remains to be
known as reflected in an assessment of research needs by the RAND Reading Study Group
(RRSG; RAND, 2002a), as well as in the report of the National Reading Panel (NICHD, 2000).
Those reports make clear that with regard to both student learning and teacher preparation, the
research base to support practice is weak.
What Should Children Know and Be Able To Do? The answer to this question is
sometimes given in terms of state or national standards for reading and language arts but such
answers are often inadequate when it comes to development over time (see the discussion of
standards for reading in Pugalee and Rickelman, chapter 5, this volume). An answer to this
question is implied by the RRSG in its definition of reading comprehension as ―the process of
simultaneously extracting and constructing meaning through interaction and involvement with
written language‖ (RAND, 2002a). To extract meaning requires the reader to decode the words
and form a mental representation of what the text actually says, at both a local (sentences,
phrases, and their interconnections) and global level (the ―gist‖ of the text‘s meaning). To
construct meaning requires that the reader create a ―situation model,‖ or an understanding of
the intended meaning conveyed with these words that is informed not just by the text, but by the
knowledge and experience that the reader brings (Kintsch, 1998). The situation model is the
foundation from which inferences are drawn. Consider the sentence, ―The sky was a clear,
bright blue the day she first saw Charles.‖ The sentence does not state that it is not raining, but
the reader can infer this from the bright blue sky. More importantly, it says nothing about whom
Charles might be to the referenced woman, but we infer that he will be significant and
memorable—not a plumber who will fix her drain then disappear.
We would be pleased if a six-year-old student could read the above sentence, and
understand it semantically. But we would expect a 16-year-old student to develop a situation
Considerations for an AA-MAS
Page 109
model that is more complex due to greater developmental maturity, more experience with texts
and text genres, and the benefits of instruction. The high school student might appreciate the
expectation created by the author with two very simple phrases, and might productively reflect
on how that expectation might change if the sky were dark and the wind threatened to carry
away all in its path. And yet our understanding of the typical progression of student reading
comprehension between ages 6 and 16 is poorly mapped, with a consequence that our
instructional support for comprehension is poorly defined as well. As the RRSG argues, ―without
research-based benchmarks defining adequate progress in comprehension, we as a society risk
aiming far too low in our expectations for student learning.‖
Research in this area has far to go. Many research perspectives offer relevant insights
(e.g., Graesser, Mills, & Zwaan, 1997; Pearson & Hamm, 2002), but as yet there are no
integrated theories and companion models that provide a foundation for accumulating
knowledge and guiding instruction. Moreover, mapping progress in reading comprehension
requires that the phenomenon can be measured. Here again the knowledge base is weak.
Worse, what we do know suggests that existing, commonly used measures of comprehension
can be misleading. They capture meaning extraction and short-term memory, but these are not
good predictors of meaning construction. Interventions that can improve short term recall can
actually weaken inferencing capacity (Mannes and Kintsch, 1987). Both the mapping of
progress in reading comprehension and the evaluation of instructional interventions to improve
reading comprehension (e.g., Beck & McKeown, 2001; Beck et al., 1997; Pressley et al., 1989)
depend on the development of assessments that can measure all its aspects, including the
quality of the situation model.
K-8 Mathematics
Investment in recent decades by federal agencies and private foundations has produced
a wealth of knowledge about the development of mathematical understanding, and
Considerations for an AA-MAS
Page 110
correspondingly has led to the development of curricula that incorporate such knowledge (e.g.,
Carpenter, Fennema, & Franke, 1996; Ginsburg, Greenes, & Balfanz, 2003; Griffin, Case, &
Siegler, 1994). Much of contemporary research and theory is synthesized in a report on
elementary mathematics (NRC, 2001b), and in the work of a RAND study group that produced a
mathematics research agenda (RAND, 2002b). The National Research Council 2001 report
presents a view of what elementary school children should know and be able to do in
mathematics that draws on a solid research base in cognitive psychology and mathematics
education, some of which is described below. It includes mastery of procedures as a critical
element of mathematics competence, but places far more emphasis on understanding when
and how to apply those procedures than is common in many mathematics classrooms. The
latter is rooted in a deeper understanding of mathematical concepts, and a facility with
mathematical reasoning.
The NRC committee summarized its view in five intertwining ―strands‖ that constitute
mathematical proficiency:
Conceptual understanding—comprehension of mathematical concepts, operations, and
relations;
Procedural fluency—skill in carrying out procedures flexibly, accurately, efficiently, and
appropriately;
Strategic competence—ability to formulate, represent, and solve mathematical
problems;
Adaptive reasoning—capacity for logical thought, reflection, explanation, and
justification;
Productive disposition—habitual inclination to see mathematics as sensible, useful, and
worthwhile, coupled with a belief in diligence and one‘s own efficacy (NRC, 2001b).
Considerations for an AA-MAS
Page 111
Pugalee and Rickelman (Chapter 5 this volume) provide an excellent discussion of the
mathematics content and process strands that have been articulated in the NCTM standards
(NCTM, 2000) and that have in turn served as the basis for NAEP and state assessments.
Much of that discussion aligns with aspects of the NRC‘s five areas of mathematical proficiency
mentioned above. It is far beyond the scope of this chapter to try to capture what is known
empirically about the multiple aspects of mathematical proficiency, including their development
as a consequence of instruction. The literature on mathematical cognition and its development
covers a diversity of topics, ranging from geometry problem solving to infant perception of
numerosity (e.g., Greeno, 1978; Starkey & Cooper, 1980). However, it may be useful to
consider some of what is known about even the most basic aspects of mathematical knowledge
and competence. Accordingly, we have limited the discussion to current cognitive science
accounts of performance on relatively basic aspects of mathematics, those that figure
prominently in the early elementary school curriculum (see also Kalchman, Moss, & Case, 2001;
Lesh & Landau, 1983; Schoenfeld, 1985).
The discussion that follows considers in some detail what we know about the basics of
addition and subtraction, including computational procedures. The goal of doing so is to help
those outside the research arena understand that even the "simplest" cognitive acts and
instructional domains imply complicated forms of knowledge that are slowly acquired through
experience and instruction. Furthermore, just as knowledge is not random, neither is
performance, especially erroneous performance. This section concludes with a consideration of
the potential value of all this detailed information for instruction and assessment. The reader
may actually want to skim that concluding material before delving into what comes next.
Basic Addition. For many basic mathematics skills, expertise is necessarily defined in
terms of the knowledge, processing activities, and performance of adults. Thus, to begin a
discussion of cognitive analyses of basic mathematics we need to focus on theories of how
adults do mental addition when faced with problems containing addends from 0–9 (e.g.,
Considerations for an AA-MAS
Page 112
Ashcraft, 1982; 1983, 1985, 1987). The theory assumes that adults have two basic types of
mathematical knowledge. One type is an interrelated knowledge network containing the basic
addition facts. As described earlier, such knowledge is referred to as declarative knowledge,
i.e., knowledge of things that are true or false such as 2+3=5. The facts stored in this network
have different strengths that determine how long it takes to activate a piece of information.
Thus, if the fact 2+3=5 has greater associative strength than the fact 7+5=12, it will take less
time to retrieve (activate) the answer to the first of these two problems. The theory also
assumes the existence of a second type of knowledge, specifically, methods that can be used to
derive answers for problems lacking prestored answers, e.g., 14 x 36 vs. 4 x 6. As described
earlier, this is referred to as procedural knowledge, i.e., knowledge of "how to" do something.
For single digit addition it might include procedures such as counting on from one of the
addends an amount equal to the other addend. Adults actually have a variety of procedures for
calculating answers, including shortcuts that make use of stored facts. An example is computing
the answer to 28+25 by retrieving the sum of 25+25 and then adding 3 to 50.
This theory may seem to be nothing more than a restatement of what is intuitively
obvious to any adult. For most of us, the "process" of adding single digit numbers is essentially
the automatic retrieval of specific facts from memory. This process is rapid, automatic,
effortless, and largely error free. What is less obvious is that such a theory of stored knowledge
and retrieval processes provides the basis for explaining several phenomena observed in
adults' time to produce or verify basic addition facts. One phenomenon is that adults produce
answers very quickly, typically in less than a second (e.g., Ashcraft, 1985; Groen & Parkman,
1972). This can be attributed to the process of activating stored knowledge, a relatively rapid
and automatic process, as opposed to computing answers by way of sequential procedures, a
relatively slow and controlled process.
A second phenomenon is that the time to produce an answer systematically varies
across problems. The slowest responses are for problems with "large" sums such as 9+8, with
Considerations for an AA-MAS
Page 113
intermediate times for problems with medium sums such as 4+7, and relatively fast responses
for problems with small sums such as 2+1, 3+2 and for ―ties‖ such as 4+4, 7+7, etc., and these
problems are relatively homogeneous in time to respond (Ashcraft & Battaglia, 1978; Ashcraft &
Stazyk, 1981; Groen & Parkman, 1972). As noted earlier, such differences in retrieval time are
attributed to differences in the strength of specific facts. Stronger associations in the knowledge
network are faster to activate.
A third phenomenon is that the time to reject a fact such as 4+3=12 is substantially
slower than the time for 4+3=10, even though the first "answer" is actually further from the
correct answer (Winkelman & Schmidt, 1974). Such effects are attributed to associative
confusions between addition and multiplication facts. (See Ashcraft, 1982; 1985 for a more
comprehensive summary of basic results in mental addition and multiplication.)
The aforementioned theory of expert solution of simple addition problems relies heavily
on the assumption of differential associative strengths across the "basic facts" formed by the
digits 0–9. An obvious question is whether this assumption is arbitrary or whether the assumed
pattern of strength differences can be related to experiential phenomena. According to the law
of frequency, items accrue strength through use and practice. Analyses of problem presentation
frequency in children's mathematics texts indicate that those "basic facts" assumed to be
stronger in the network actually appear more frequently in the texts (Ashcraft, 1985).
Furthermore, analyses of multicolumn addition reveal that the frequency of adding 1,2, or 3 is
greater than that of adding 7,8, or 9, consistent with strength patterns in the network.
Given that this theory is a plausible account of adult or expert performance, the question
of developmental and instructional import concerns the nature of the progression from novice to
expert. The acquisition of expertise in addition actually has its roots in the more general domain
of number knowledge and quantitative understanding, acquisitions that are strongly tied to
children's counting behavior (e.g., Gelman & Gallistel, 1978; Steffe, von Glaserfeld, Richards &
Cobb, 1983). Prior to school entry most children have acquired relatively sophisticated counting
Considerations for an AA-MAS
Page 114
sequences for the digits 1-20 (Fuson & Hall, 1983; Gelman & Gallistel, 1978). Children also
have a basic understanding of the "semantics" of addition and subtraction in terms of the
combining and separating of quantities (e.g., Carpenter, 1985; Resnick, 1982, 1984). Their
understanding of addition, in concert with their knowledge of counting, permits the solution of
addition problems even in the absence of directly stored facts (e.g., Starkey & Gelman, 1982).
Substantial evidence now exists that initial knowledge of addition consists of procedures for
representing, combining and counting physical entities. Subsequently, addition can be
performed as mental counting operations in the absence of physical objects. Such overt and
covert operations constitute forms of procedural knowledge and processing that develop prior to
and along with declarative knowledge and direct retrieval of addition facts (Fuson, 1982).
Evidence for an addition acquisition sequence of the type described above is of several
types. First, young children with primitive counting skills often cannot solve simple addition
problems if the objects representing one of the addends are hidden (Steffe, Thompson &
Richards, 1982). Second, children are often observed counting fingers when solving addition
problems (Fuson, 1982). Third, the counting procedures used by children transition from
counting up to the cardinal value of the first addend and then counting on an amount equal to
the second addend, to simply counting on from the first addend (Carpenter, 1985; Fuson, 1982;
Houlihan & Ginsburg, 1981). Fourth, the time to do addition problems is closely related to
counting rates for young children but not for older children (Ashcraft, Fierman, & Bartolotta,
1984). Fifth, systematic differences in the time to answer problems are consistent with models
that minimize the number of counts, i.e., use of a procedure of counting on from the larger
addend (Groen & Parkman, 1972; Svenson, 1975). Sixth, even for young children, there are
some "facts" that are directly retrieved such as ties and small sums (Groen & Parkman, 1972;
Hamann & Ashcraft, 1985; Siegler & Shrager, 1984).
A developmental theory of the acquisition of expertise in addition includes specific
assumptions about the state of both declarative and procedural knowledge at different points in
Considerations for an AA-MAS
Page 115
time. It includes the assumption that there is a gradual acquisition and strengthening of the
network structure of addition facts. There is also a gradual acquisition of counting procedures
that permit the calculation of answers when "facts" are not of sufficient strength to be retrieved.
Preschoolers primarily depend on overt counting procedures to solve addition problems (Siegler
& Shrager, 1984). Given instruction and practice in the early grades, there is a transition to more
sophisticated and efficient counting procedures together with a transition from calculation via
counting to direct retrieval. Thus, at any point in time from preschool age through at least fourth
grade, a child will have some facts that can be retrieved and some that need to be calculated.
From the fourth grade on through adulthood, simple addition problems are solved via retrieval
with a continued strengthening of facts in the network resulting in further increases in the speed
of retrieving all addition facts (Ashcraft, 1985).
Subtraction. This discussion has concentrated on addition but the issues raised about
the nature of expertise and its acquisition are equally applicable to simple subtraction problems.
One can posit exactly the same type of theory of expertise for subtraction, with a network of
stored facts of varying strength and a set of procedures for calculation in the absence of directly
retrievable information. It is also reasonable to assume that subtraction facts vary in strength
(speed of retrieval) although far less is known about the details of such differences and whether
they parallel the results for addition.
With regard to procedural knowledge and the acquisition of expertise, there is ample
evidence that preschoolers and children in the early primary grades solve subtraction problems
by counting procedures, both overt and covert (Fuson, 1984; Svenson & Hedenborg, 1979;
Woods, Resnick & Groen, 1975). Considerable research has been done on the use of different
counting procedures to solve subtraction problems and the difficulties children sometimes
experience in understanding and using such procedures (Fuson, 1984). One is a decrementing
procedure in which the child counts down from the larger number (e.g. 9) an amount equal to
the smaller number (e.g., 2). Another is an incrementing procedure in which the child counts on
Considerations for an AA-MAS
Page 116
from the smaller number (e.g., 7) until the larger number (e.g., 9) is reached. These procedures
not only differ in ease of use but also in efficiency depending on problem characteristics. A
decrementing procedure is more efficient when there is a large difference between the numbers
(e.g., 9-2), while the converse is true for the incrementing procedure (e.g., 9-7). There is some
evidence that older children select the optimal counting procedure given such differences in
problem characteristics (Svenson & Hedenborg, 1979; Woods et. al., 1975).
A theory of expertise in subtraction and its acquisition is similar to the theory for addition.
Both emphasize the gradual acquisition of declarative knowledge facts. These changes in
knowledge and processing occur over a period of several years. The rate of change both within
and between individuals will vary with the experiential history and learning rate of each person.
Thus, one must consider the possibility that the difficulties in mathematics manifest by some
children are partially attributable to problems with basic facts. The facts may be sufficiently weak
such that they cannot be retrieved and must therefore be computed, and the counting
procedures for doing such computations may be slow and error prone.
Data on basic addition and subtraction performance suggest that children with
mathematics difficulties often must compute rather than directly retrieve answers to problems
(e.g., Connor, 1983; Goldman, Mertz, & Pellegrino, 1988; Russell & Ginsburg, 1984). Connor
(1983) has reported results obtained by Fleishner and her colleagues from testing basic facts.
Learning disabled students relied more on reconstructive counting strategies than the
nondisabled students who tended to rely on direct retrieval. This agrees with the results
obtained by Russell & Ginsburg (1984) who compared a group of math-disabled fourth graders
to nondisabled third and fourth graders. They observed particular difficulties in retrieving
addition facts by math-disabled students, with the children performing at a level below the
nondisabled third graders. Svenson and Broquist (1975) have also reported results indicating
that fifth-grade children with low mathematics achievement are particularly slow at answering
simple addition problems. Although available data are suggestive of difficulties in simple
Considerations for an AA-MAS
Page 117
addition and subtraction, considerably more must be done to pursue these issues. The theory of
expertise and its acquisition that has been outlined above provides a framework for
systematically pursuing issues regarding both the assessment and instruction of basic skills
(see also Baroody, Bajwa, & Eiland, 2009).
Mathematical Procedures: Subtraction. Knowledge and performance in basic skills are
particularly important when we consider more complex mathematical procedures that require
facility in such skills. For example, the typical course of instruction is to progress from single
column addition and subtraction problems to multicolumn problems of increasing difficulty. The
ultimate objective is knowledge of complex procedures such that the individual can solve any
addition or subtraction problem of any length. What do individuals know and do when they are
"experts" in multicolumn addition or subtraction? There are now explicit theories of the
knowledge underlying such complex skills, with primary attention given to subtraction (e.g.,
Brown & Burton, 1978; Young & O'Shea, 1981). Part of the emphasis on subtraction is
attributable to the difficulties children often have in solving subtraction problems with borrowing,
especially "borrowing from zero."
Knowledge of subtraction can be conceptualized as a complex procedure with multiple
parts, each of which represents a successive complication. The essential parts are (1)
processing single columns in a right to left order, (2) borrowing when the bottom digit in a
column is greater than the top digit, and (3) borrowing from zero. These three parts correspond
to the typical sequence in learning how to subtract. The child first learns how to subtract a single
column of numbers where the top number is always greater than the bottom number. Then this
is expanded to multiple columns but in problems where borrowing is never needed. The
assumption is frequently made that the child subtracts two numbers in a column by retrieving a
"fact" from memory such as 7-5=2. However, a child might actually perform the subtraction for
single digits by a counting procedure. The next major stage is to introduce the borrowing part of
the procedure. This involves a test to see if the top number is greater than the bottom number in
Considerations for an AA-MAS
Page 118
a column. If it is, then borrowing is needed and the sequence of steps is taught. In beginning
instruction this usually takes the form of crossing out the top digit in the column to the left,
decrementing it by one then writing the new digit in the top of that column. The child then writes
a 1 in front of the top digit in the original column and now goes on to do the column subtraction
by retrieving a fact such as 17-9=8. Practice in borrowing is provided with a progression to
problems with multiple columns that require borrowing. The final stage of instruction is the
procedure for borrowing from zero. The original borrowing procedure is now expanded to
include a test for whether the column to the left contains a zero. If a zero is present then a new
set of operations must be executed which include changing zero to 9 and moving one column to
the left, testing for zero again etc.
The preceding is a superficial description of the overall procedure for doing multicolumn
subtraction, its separate subprocedures and the general sequence for acquiring the
subprocedures. Adults typically have procedural knowledge of subtraction as well as declarative
knowledge of the meaning (semantics) of individual actions such as borrowing or borrowing
from zero relative to the base ten system. It is not clear, however, whether children comprehend
the meaning of the procedures taught to them. Analyses of children's errors in subtraction
suggest that they often follow faulty procedures that preserve "syntactic" aspects of subtraction
procedures such as crossing things out or writing down a 1 while simultaneously violating the
semantics of the procedures (see e.g., Resnick, 1982,1984).
Expertise can be defined as being able to solve any subtraction problem, which
minimally implies knowledge of all the elements of the subtraction procedure. Lower levels of
expertise are defined by the probability that errors will occur. Errors in subtraction can imply (a)
lapses of attention or memory, what have been termed slips (Norman, 1981), (b) the absence of
a procedure or a step in a procedure, or (c) incorrect representation of a procedure or a step in
a procedure. If errors are due to lapses of attention or memory failure such as retrieving 2 for 96, then there should be no pattern to the errors made by the child. However, if a child is lacking
Considerations for an AA-MAS
Page 119
knowledge or has incorrect knowledge of a procedure then systematic error patterns should be
observed within a child. To the extent that many children experience similar difficulty in
acquiring and/or representing complex procedures, then one would expect to find consistent
error patterns across children.
Considerable effort has been expended on analyzing children's errors on subtraction
problems (Burton, 1981; Brown & Burton, 1978; Brown & VanLehn, 1980; Friend & Burton,
1981; VanLehn, 1983, 1990; Young & O'Shea, 1981). It is now apparent that errors are not just
random, i.e., they cannot be attributed primarily to slips. Instead, errors tend to be systematic
and the systematicity can be directly related to one or more of the elements of the major
subprocedures of the complete subtraction procedure. As might be suspected, most of the
systematic errors involve borrowing in general and borrowing from zero in particular. A common
error is "smaller from larger" in which the child subtracts the smaller digit in a column from the
larger regardless of which one is on top. This may be due to a child's lack of knowledge about
how to borrow, a failure to incorporate a test for borrowing, or a carryover from simple
subtraction where the smaller number is always "taken away" from the larger number and
position doesn't matter.
Many common errors involve borrowing from zero. An example is changing zero to 9 but
failing to decrement the column to the left of zero. A different type of error is borrowing across
zero such that the column to the left of the zero is decremented by one but the zero is left
unchanged. One final example involving borrowing from zero is to stop the borrowing process at
zero. In this case the child correctly adds ten to the column where the top digit is less than the
bottom digit but fails to make any change in either the column to the left containing zero or the
column to the left of the zero. Another major set of errors involves the process of subtracting
from zero within a column. In these cases the child fails to use any borrowing procedure and
instead writes 0 or N for the column 0-N. For a more complete discussion of the most frequently
Considerations for an AA-MAS
Page 120
occurring errors in children's subtraction see Brown and Burton (1978), VanLehn (1983, 1990),
and Young and O'Shea (1981).
One way to conceptualize the underlying source of these types of errors is in terms of
slightly flawed procedural knowledge. The child has represented the procedures for performing
subtraction but one or more of the elements is incorrectly represented, i.e., the child has a "bug"
in his program for doing subtraction. The term bug is taken from computer programming and
reflects an algorithm that contains an incorrect operation. A systematic error is produced each
time the program is run on the particular class of problems that requires execution of that
operation. An alternative possibility is that the child is missing a piece of procedural knowledge,
which is similar to a critical operation being omitted from a program. In a computer program, a
missing operation will typically cause the program to "crash" and produce no output whatsoever.
However, in the case of a child who knows that some response must be made, the child
reaches an impasse. In order to move on the child attempts to repair that impasse by doing
something. The something he or she does is an operation that may mimic syntactic but not
semantic constraints of subtraction.
Given that children's errors in subtraction reflect slips, bugs, and impasses (VanLehn,
1983, 1990), there are several issues with respect to the applicability of such a theory of
knowledge and performance. One issue is a diagnosis of a child's problem. It is a nontrivial
exercise to develop tests capable of isolating the many different types of procedural bugs and
impasses that can occur, often in peculiar interaction, as well as a scoring procedure to do the
diagnosis (Burton, 1981). Furthermore, multiple samples of performance are needed to
determine if there is a stable pattern of bugs and/or impasses (see VanLehn, 1983). There are,
however, some systematic efforts in this direction using instructional materials and computerbased tests (VanLehn, 1983).
Other issues involve explaining the acquisition of flawed procedural knowledge and
developing instructional methods that minimize such outcomes. A missing procedure that gives
Considerations for an AA-MAS
Page 121
rise to an impasse in solution may result from a failure on the part of a student to represent a
specific operation. Thus, the child attempts to repair the overall subtraction procedure when an
impasse is actually reached in solving a problem (Brown & VanLehn, 1980). If these repairs are
practiced and fail to receive any corrective feedback they may become permanent bugs.
Another possibility is that a child initially misrepresents a procedure and then subsequently
practices that flawed procedure, again without corrective feedback. Thus, bugs can arise from
repairs to impasses, i.e., solution attempts for novel problems for which no procedure is
represented. They can also arise from incorrect initial representations of correct procedures. In
either case, the errors that children produce seem to follow many of the syntactic aspects of
subtraction (crossing things out, writing 1 in a column, etc.) while violating some of the
semantics of the procedures. Given this state of affairs, attempts have been made to investigate
instructional methods that link more closely the semantics and syntax of complex procedures
(Resnick, 1982, 1984). The hope is that such methods can minimize the development of flawed
procedural knowledge.
It is almost a given that elementary school children experiencing difficulty in mathematics
will demonstrate less than expert performance on problems requiring complex procedures.
Concern then is whether the errors they make can be understood in terms of the theory of
knowledge and performance described above. One possibility is that such children have all the
correct procedures and that errors are due to slips and miscalculations associated with their
weak "knowledge" of basic facts. This may be partially true (Russell & Ginsburg, 1984). A
second possibility is that parts of the procedural knowledge are either missing or flawed, in
which case the errors they make would be systematic. If there are systematic errors, then do
these children exhibit "bugs" similar to those found in previous research or are their errors more
bizarre? There is little in the way of systematic data to address these questions. Russell and
Ginsburg (1984) have reported limited data indicating that math-disabled fourth graders have
bugs similar to those exhibited by nondisabled, younger children. They offer a hypothesis of
Considerations for an AA-MAS
Page 122
"essential cognitive normality" in which math-disabled children are at the lower levels of
expertise representing the knowledge and performance of younger children. Considerably more
needs to be done to explore such a hypothesis as it applies to complex procedural skills, as well
as other important aspects of mathematical proficiency as identified by the National Research
Council (2001b).
Is All This Detail Necessary?
It is not uncommon for individuals to ask what useful purpose, beyond esoteric academic
pursuits, is served by the foregoing consideration of what we know about the knowledge and
cognitive processing underlying something as ―simple‖ as basic reading or mathematics
knowledge and skill? As mentioned earlier, the preceding was designed help those outside the
research arena understand that even the "simplest" cognitive acts and instructional domains
imply complicated forms of knowledge that are slowly acquired through experience and
instruction. Furthermore, just as knowledge is not random, neither is performance, especially
erroneous performance. In fact, some would argue that we can learn far more from mistakes
than we do from correct answers. Unfortunately, test content and test scores focus on just the
opposite. For one thing, test items are often far removed from a theory of the knowledge
underlying the performance of interest, and test scores provide little in the way of information
that is directly useful to teachers to guide instructional decision making. In a typical test, the
items are sampled from some universe of possibilities and the emphasis is not on the individual
problem but the score derived by aggregating over problems. This leads to a situation where the
same score can have very different meanings but there is no way of knowing that since the
focus is on the total score rather than the way in which the score was produced. If the research
within cognitive science has told us anything it is that the process by which a response is
produced is far more important than the product. The same products can often result from very
Considerations for an AA-MAS
Page 123
different thinking processes, and testing procedures are frequently insensitive to such
differences.
Consider, for example, a case where two children have systematic but different
misconceptions involving borrowing in multicolumn subtraction. They might well achieve the
same score by missing different problems. Even if they miss the same problem the nature of
their errors might be different. Typical tests and test scoring procedures do not discriminate
among these possibilities because they were not designed to do so nor do they provide any
information about the incorrect choices that were made. A similar situation could arise with
respect to tests of basic math facts. Tests of basic addition and subtraction facts are usually
timed. What matters is the number of correct answers within the time period allotted. What is
often ignored is how the number correct relates to the number attempted and the nature of the
errors made on those attempted. In this regard the author is reminded of an actual situation that
arose when one of his children brought home a test of addition and subtraction basic facts. All of
the addition facts were correct but almost all of the subtraction facts were wrong. The note on
the paper said that he should memorize his basic math facts. He was clearly distressed
because he didn't know what the teacher meant. I examined his test and noticed that for all the
subtraction fact answers that were incorrect they were off by 1. This suggested that he was not
recalling his facts from memory but was using a counting scheme that had a systematic flaw or
bug. I sat him down and got him to explain how he arrived at his answers and discovered that
he was using a ―counting down‖ procedure but with an extra count. I showed him how to correct
his ―buggy‖ procedure, he practiced the new one for a while until it was reliable and off he went
content that he wasn‘t stupid and that he could now get the right answers. It was true that he still
didn't "know" his subtraction facts but eventually he would because the counting procedure
would yield the right answers and this in turn would give way to retrieval from memory once
each of the facts was sufficiently strong to be associatively retrieved. The point of this little true
example is that tests and testing procedures need to be brought in correspondence with current
Considerations for an AA-MAS
Page 124
theories of the nature of expertise in the domain of interest and the nature of the acquisition
process. It is far more helpful to know that a child understands how to do subtraction and what it
means, albeit he is less than fluent in fact retrieval, than to know that he misses 70% of all his
subtraction facts.
There is an obvious challenge in translating theories about content knowledge and the
acquisition of expertise into acceptable and workable instructional and testing procedures. To
think that this is an easy task is to seriously underestimate the practical problems of the
translation and implementation process. On the one hand, researchers must be willing to
expend the time and effort needed to articulate their theories and assessment procedures in
ways that are operationally feasible. Assessment developers must be willing to adopt new
measurement models, scoring and reporting procedures. Educational practitioners must be
willing to articulate their needs regarding the instructional monitoring functions they would like to
perform and then find ways to incorporate new teaching and assessment technologies into daily
classroom practices.
Implications for Assessment Design
Existing guidelines for assessment design emphasize that the process of assessment
design should begin with a statement of the purpose for the assessment and a definition of the
content domain to be measured (AERA, APA, NCME, 1999). A central thesis of this chapter is
that the targets of inference should also be largely determined by a model of cognition and
learning that describes how people represent knowledge and develop competence in the
domain (the cognition element of the assessment triangle). Starting with a model of learning is
one of the main features that distinguishes the proposed approach to assessment design from
typical current approaches. The model suggests the most important aspects of student
achievement about which one would want to draw inferences, and provides clues about the
Considerations for an AA-MAS
Page 125
types of assessment tasks that will elicit evidence to support those inferences (see also NRC,
2001a; Pellegrino, 1988; Pellegrino, Baxter, & Glaser, 1999).
The model of learning that informs assessment design should have as many as possible
of the following key features. Ideally, it should:
1. Be based on empirical studies of learners in the domain.
2. Identify performances that differentiate beginning and expert performance in the domain.
3. Provide a developmental perspective, laying out typical progressions from novice levels
toward competence and then expertise, and noting landmark performances along the
way.
4. Allow for a variety of typical ways in which children come to understand the subject
matter.
5. Capture some, but not all, aspects of what is known about how students think and learn
in the domain. Starting with a theory of how people learn the subject matter, the
designers of an assessment will need to select a slice or subset of the larger theory as
the targets of inference.
6. Lend itself to being aggregated at different grain sizes so that it can be used for different
assessment purposes (e.g., to provide fine-grained diagnostic information as well as
coarser-grained summary information).
As described earlier, research on cognition and learning has produced a rich set of
descriptions of domain-specific performance that can serve as the basis for assessment design,
particularly for certain areas of reading, mathematics and science (e.g., NRC, 1998; 2000;
2001b; 2005, 2007; AAAS, 2001). Yet much more research is needed, especially with regard to
students who are low achievers and who may have various identifiable learning or cognitive
disabilities. This is despite the fact that a significant body of work already exists regarding
students with disabilities and their performance in aspects of mathematics (e.g., Baroody et al.,
2009; Fuchs, et al., 2005; Goldman, et al., 1988; Miller, 1997; Russell & Ginsburg, 1984;
Considerations for an AA-MAS
Page 126
Swanson & Jerman, 2006) and their performance in aspects of reading (e.g., Connor, 1983;
Fletcher et al., 2002; Foorman & Torgesen, 2001; O‘Connor & Jenkins, 1999; Torgesen, 2002;
Wagner et al., 1993; Wagner et al., 1997).
What follows are some of the implications of the knowledge we do have for multiple
aspects of assessment design and use. We begin with a consideration of issues related to
assessment purpose and move to implications of cognitive research and theory for assessment
that occurs in the context of the classroom and for state-level large-scale accountability
assessment. Many of the specific topics that are touched on subsequently related to
assessment design and use, including issues of validity and fairness, are developed in much
greater depth in the chapters that follow in Sections III and IV of this overall report.
Assessment Purposes, Levels & Timescales
Although assessments are currently used for many purposes in the educational system,
a premise of the Knowing What Students Know report (NRC 2001a) is that their effectiveness
and utility must ultimately be judged by the extent to which they promote student learning. The
aim of assessment should be ―to educate and improve student performance, not merely to audit
it‖ (Wiggins, 1998, p.7). Because assessments are developed for specific purposes, the nature
of their design is very much constrained by their intended use. The reciprocal relationship
between function and design leads to concerns about the inappropriate and ineffective use of
assessments for purposes beyond their original intent. To clarify some of these issues of
assessment purpose, design, and use it is worth considering two pervasive dichotomies in the
literature that are often misunderstood and conflated.
The first dichotomy is between ―internal‖ classroom assessments administered by
teachers, and ―external‖ tests administered by districts, states, or nations. Ruiz-Primo,
Shavelson, Hamilton, and Klein (2002) showed that these two very different types of
assessments are better understood as two points on a continuum that is defined by the
Considerations for an AA-MAS
Page 127
―distance‖ from the enactment of specific instructional activities. They defined five discrete
points on the continuum of assessment distance: immediate (e.g., observations or artifacts from
the enactment of a specific activity), close (e.g., embedded assessments and semi-formal
quizzes of learning from one or more activities), proximal (e.g., formal classroom exams of
learning from a specific curriculum), distal (e.g., criterion-referenced achievement tests such as
required by the U.S. No Child Left Behind legislation), and remote (broader outcomes measured
over time, including norm-referenced achievement tests and some national and international
achievement measures. Different assessments should be understood as different points on this
continuum if they are to be effectively aligned with each other and with curriculum and
instruction.
A second pervasive dichotomy is the one between ―formative‖ assessments used to
advance learning and ―summative‖ assessments used to provide evidence of prior learning.
Often it is assumed that classroom assessment is synonymous with formative assessment, and
that large-scale assessment is synonymous with summative assessment. What are now widely
understood as different types of assessment practices are more productively understood as
different functions of assessment practice, and that summative and formative functions can be
identified for most assessment activities regardless of the level on which they function.
Drawing from the work of Lemke (2001), it is apparent that different assessment
practices can be understood as operating at different timescales. The timescales for the five
levels defined above can be characterized as minutes, days, weeks, months, and years.
Timescale is important because the different competencies that various assessments aim to
measure (and therefore the appropriate timing for being impacted by feedback) are ―timescalespecific.‖ The cycles, or periodicity, of educational processes build from individual utterances
into an individual‘s lifespan of educational development. What teachers and students say in
class constitute verbal exchanges; these exchanges make up the lesson; a sequence of lessons
make up the unit; units form a curriculum, and the curricula form an education. Each of these
Considerations for an AA-MAS
Page 128
elements operates on different cycles or timescales: second to second, day to day, month to
month, and year to year.
The level at which an assessment is intended to function, which involves varying
distance in ―space and time‖ from the enactment of instruction and learning, has implications for
how and how well it can fulfill various functions of assessment, be they formative, summative, or
program evaluation (see NRC, 2003). As argued elsewhere (Hickey & Pellegrino, 2005;
Pellegrino & Hickey, 2006), it is also the case that the different levels and functions of
assessment can have varying degrees of match with theoretical stances about the nature of
knowing and learning. With this in mind we now turn to the implications of cognitive theory and
research for both classroom assessment practices and for large-scale assessment. These two
contexts reflect some of the rich variation in assessment captured by the foregoing discussion of
levels, functions, and timescales.
Implications of Cognitive Theory & Research for Classroom Assessment
Shepard (2000) discusses ways in which classroom assessment practices need to change to
better support learning: the content and character of assessments need to be significantly improved
to reflect contemporary understanding of learning; the gathering and use of assessment information and
insights must become a part of the ongoing learning process; and assessment must become a
central concern in methods courses in teacher preparation programs. Her messages are reflective of
a growing belief among many educational assessment experts that if assessment, curriculum, and
instruction were more integrally connected, student learning would improve (e.g., Pellegrino, Baxter,
& Glaser, 1999; Stiggins, 1997).
Sadler (1989) provides a conceptual framework that places classroom assessment in the
context of curriculum and instruction. According to this framework, three elements are required for
assessment to promote learning:
A clear view of the learning goals (derived from the curriculum)
Considerations for an AA-MAS
Page 129
Information about the present state of the learner (derived from assessment)
Action to close the gap (taken through instruction)
Furthermore, there are ongoing, dynamic relationships among formative assessment,
curriculum, and instruction. That is, there are important bidirectional interactions among the three
elements, such that each informs the other. For instance, formulating assessment procedures for
classroom use can spur a teacher to think more specifically about learning goals, thus leading to
modification of curriculum and instruction. These modifications can, in turn, lead to refined assessment
procedures, and so on. The mere existence of classroom assessment along the lines discussed here
will not ensure effective learning. The clarity and appropriateness of the curriculum goals, the
validity of the assessments in relationship to these goals, the interpretation of the assessment
evidence, and the relevance and quality of the instruction that ensues are all critical determinants of
the outcome. Starting with a model of cognition and learning in the domain can enhance each of
these determinants.
For most teachers, the ultimate goals for learning are established by the curriculum, which is
usually mandated externally (e.g., by state content standards). However, teachers and others
responsible for designing curriculum, instruction, and assessment must fashion intermediate goals
that can serve as an effective route to achieving the ultimate goals, and to do so effectively they
must have an understanding of how students represent knowledge and develop competence in the
domain. National and state content standards set forth learning goals, but often not at a level of
detail that is useful for operationalizing those goals in instruction and assessment. By dividing goal
descriptions into sets appropriate for different age and grade ranges, current content standards
provide broad guidance about the nature of the progression to be expected in various subject
domains. Whereas this kind of epistemological and conceptual analysis of the subject domain is an
essential basis for guiding assessment, deeper cognitive analysis of how people learn the subject
matter is also needed. Formative assessment should be based in cognitive theories about how
Considerations for an AA-MAS
Page 130
people learn particular subject matter to ensure that instruction centers on what is most important
for the next stage of learning, given a learner‘s current state of understanding.
It follows that teachers need training to develop their understanding of cognition and
learning in the domains they teach. Preservice and professional development are needed to
uncover teachers‘ existing understandings of how students learn and to help them formulate models
of learning so they can identify students‘ naive or initial sense-making strategies and build on those
to move students toward more sophisticated understandings. The aim is to increase teachers‘
diagnostic expertise so they can make informed decisions about next steps for student learning.
This has been a primary goal of cognitively based approaches to instruction and assessment that
have been shown to have a positive impact on student learning, including the Cognitively Guided
Instruction program (Carpenter, Fennema, and Franke, 1996) and others (Cobb et al., 1991; Griffin
& Case, 1997). Such approaches rest on a bedrock of informed professional practice.
Implications of Cognitive Research and Theory for Large-scale Assessment
Large-scale assessments are further removed from instruction but can still benefit learning if
well designed and properly used. Substantially more valid, useful, and fair information could be
gained from large-scale assessments if the principles of design set forth above and described
subsequently were applied. However, fully capitalizing on contemporary theory and research will require
more substantial changes in the way large-scale assessment is approached, and relaxation of some of
the constraints that currently drive large-scale assessment practices.
Large-scale summative assessments should focus on the most critical and central aspects
of learning in a domain as identified by content standards and informed by cognitive research and
theory. Large-scale assessments typically will reflect aspects of the model of learning at a less
detailed level than classroom assessments, which can go into more depth because they focus on a
smaller slice of curriculum and instruction. For instance, one might need to know for summative
purposes whether a student has mastered the more complex aspects of multicolumn subtraction,
Considerations for an AA-MAS
Page 131
including borrowing from and across zero, rather than exactly which subtraction bugs lead to
mistakes. At the same time, while policymakers and parents may not need all the diagnostic detail
that would be useful to a teacher and student during the course of instruction, large-scale
summative assessments should be based on a model of learning that is compatible with and
derived from the same set of knowledge and beliefs about learning as classroom assessment.
As described previously, research on cognition and learning suggests a broad range of
competencies that should be assessed when measuring student achievement, many of which are
essentially untapped by current assessments. Examples are knowledge organization, problem
representation, strategy use, metacognition, and participatory activities (e.g., formulating questions,
constructing and evaluating arguments, contributing to group problem-solving). Furthermore, largescale assessments should provide information about the nature of student understanding, rather
than simply ranking students according to general proficiency estimates.
Large-scale assessments not only serve as a means for reporting on student achievement,
but also reflect aspects of academic competence societies consider worthy of recognition and
reward. Thus, large-scale assessments can signal worthwhile targets for educators and students to
pursue. Whereas teaching directly to the items on a test is not desirable, teaching to the theory of
cognition and learning that underlies an assessment can provide positive direction for instruction.
A major problem is that only limited improvements in large-scale assessments are possible
under current constraints and typical standardized testing scenarios. Large-scale assessments are
designed to meet certain purposes under constraints that often include providing reliable and
comparable scores for individuals as well as groups; sampling a broad set of content standards
within a limited testing time per student; and offering cost-efficiency in terms of development,
scoring, and administration. To meet these kinds of demands, designers typically create
assessments that are given at a specified time, with all students being given the same (or parallel)
tests under strictly standardized conditions (often referred to as ―on-demand‖ assessment). Tasks
are generally of the kind that can be presented in paper-and-pencil format, that students can respond
Considerations for an AA-MAS
Page 132
to quickly, and that can be scored reliably and efficiently. In general, competencies that lend
themselves to being assessed in these ways are tapped, while aspects of learning that cannot be
observed under such constrained conditions are not addressed. To design new kinds of situations
for capturing the complexity of cognition and learning will require examining the assumptions and
values that currently drive assessment design choices and breaking out of the current paradigm to
explore alternative approaches to large-scale assessment.
The Design of Observational Situations
Once the purpose for an assessment, the underlying model of learning in the domain,
and the desired types of inferences to be drawn from the results have been specified, situations
must be designed for collecting evidence to support the desired inferences about what students
know and can do.
Task design. The focus should be on the cognitive demands of tasks (the mental
processes and knowledge required for successful performance), rather than primarily on surface
features, such as how tasks are presented to students or the format in which students are asked
to respond. For instance, it is commonly believed that multiple-choice items are limited to
assessing low-level processes such as recall of facts, whereas performance tasks elicit more
complex cognitive processes. However, research shows that the relationship between item
format and cognitive demands is not so straightforward (Baxter and Glaser, 1998; Hamilton,
Nussbaum, and Snow, 1997).
Linking tasks to the model of cognition and learning forces attention to a central principle
of task design—that tasks should emphasize the features that are relevant to the construct
being measured and minimize extraneous features (AERA, APA, NCME, 1999; Messick, 1993).
Ideally, a task will not measure aspects of cognition that are irrelevant to the targeted
performance. For instance, when assessing students‘ mathematical reasoning, one should
avoid presenting problems in contexts that might be unfamiliar to a particular population of
Considerations for an AA-MAS
Page 133
students. Similarly, mathematics tasks should not make heavy demands for reading or writing
unless one is explicitly aiming to assess students‘ abilities to read or communicate about
mathematics. Surface features of tasks do need to be considered to the extent that they affect
or change the cognitive demands of the tasks in unintended ways.
Task difficulty. The difficulty of tasks should be explained in terms of the underlying
knowledge and cognitive processes required, rather than simply in terms of statistical item
difficulty indices, such as the proportion of respondents answering the item correctly (which
ignores that two tasks with similar surface features can be equally difficult, but for very different
reasons). Beyond knowing that 80 percent of students answered a particular item incorrectly, it
would be educationally useful to know why so many did so, that is, to identify the sources of the
difficulty so they could be remedied.
Cognitive theory and analysis can be helpful here. For instance, cognitive research
shows that a mathematics word problem that describes the combining of quantities and seeks
the resultant total (e.g., John has 3 marbles and Mary has 5, how many do they have
altogether?) is easier to comprehend than one that describes the same actors but expresses a
comparison of their respective quantities (e.g., John has 3 marbles. He has 2 less than Mary.
How many does she have?) (see e.g., Morales, Shute & Pellegrino, 1985; Riley, Greeno, &
Heller, 1983). Part of the difficulty for children is the conflict between the relational expression
less than, which implies subtraction, and the operation required, which involves addition.
The point is not that such sources of difficulty should necessarily be avoided. Rather,
these kinds of cognitive complexities should be introduced into the assessment tasks in
principled ways in those cases in which one wants to draw inferences about whether students
can handle them. There are many reasons why educators might want to assess students‘
abilities to apply integrated sets of skills (e.g., literacy and mathematics capabilities) to complex
problems. That is entirely consistent with the approach being set forth here, as long as
assessment design begins with a model of learning that describes the complex of skills,
Considerations for an AA-MAS
Page 134
understandings, and communicative practices that one is interested in making inferences about,
and tasks are specifically designed to provide evidence to support those inferences.
Scoring. Tasks and the procedures to be used for drawing the relevant evidence from
students‘ responses to those tasks must be considered together. That is, the ways in which
student responses will be scored should be conceptualized during the design of a task. A task
may stimulate creative thinking or problem solving, but such rich information will be lost unless
the means used to interpret the responses capture the evidence needed to draw inferences
about those processes. Like tasks, scoring methods must be carefully constructed to be
sensitive to critical differences in levels and types of student understanding identified by the
model of learning. At times one may be interested in the quantity of facts a student has learned,
for instance, when one is measuring mastery of the alphabet or multiplication table. However, a
cognitive approach generally implies that when evaluating students‘ responses, the focus
should be on the quality or nature of their understanding, rather than simply the quantity of
information produced. In many cases, quality can be modeled quantitatively; that is, even in very
qualitative contexts, ideas of order and orderliness will be present.
Task sets and assembly of an assessment instrument. An assessment should be more
than a collection of items that work well individually. The utility of assessment information can
be enhanced by carefully selecting tasks and combining the information from those tasks to
provide evidence about the nature of student understanding. Sets of tasks should be carefully
constructed and selected to discriminate among different levels and kinds of understanding that
are identified in the model of learning. To illustrate this point simply, it takes more than one item
or a collection of unrelated items to diagnose a procedural error in subtraction. If a student
answers three of five separate subtraction questions incorrectly, one can infer only that the
student is using some faulty process(es), but a carefully crafted collection of items can be
designed to pinpoint the limited concepts or flawed rules the student is using.
Considerations for an AA-MAS
Page 135
Validation
Traditionally, validity concerns associated with achievement tests have tended to center
around test content, that is, the degree to which the test samples the subject matter domain
about which inferences are to be drawn. There is increasing recognition within the assessment
community that traditional forms of validation — which emphasize expert appraisal of the
alignment of tasks with content frameworks, and their statistical consistency with other
measures — should be supplemented with evidence of the cognitive or substantive aspect of
validity (e.g., AERA, APA, NCME, 1999; Messick, 1993). That is, the trustworthiness of the
interpretation of test scores should rest in part on empirical evidence that the assessment tasks
actually tap the intended cognitive processes and knowledge.
As described by Messick (1993) and summarized by Magone, Cai, Silver, and Wang
(1994), a variety of techniques can be used to examine the processes examinees use during
task performance to evaluate whether prospective items are functioning as intended. These
techniques include protocol analysis, in which students are asked to think aloud as they solve
problems or to describe retrospectively how they solved the problems. Another method is
analysis of reasons, in which students are asked to provide rationales for their responses to the
tasks. A third method is analysis of errors, in which one draws inferences about processes from
incorrect procedures, concepts, or representations of the problems.
Situative and sociocultural research on learning suggests that validation should be taken
a step further. This body of research emphasizes that cognitive processes are embedded in
social practices. From this perspective, performance of students on tests is understood as an
activity in the situation that the test presents and success depends on abilities to participate in
the practices of test taking (Greeno, Pearson, & Schoenfeld, 1996). It follows that validation
should include the collection of evidence that test-takers possess the communicative practices
required for their responses to be actual indicators of their abilities, for instance, to understand
Considerations for an AA-MAS
Page 136
and reason. This has been demonstrated to be false in many cases (e.g., Cole, Gay, and Glick,
1968).
Reporting
Although reporting of results occurs at the end of an assessment cycle, assessments
must be designed from the outset to ensure that reporting of the desired types of information will
be possible. The familiar distinction between norm-referenced and criterion-referenced testing
(Glaser, 1963) is salient in understanding the central role of a model of learning in the reporting
of assessment results. The notion of criterion-referenced testing has gained popularity in the
last several decades, particularly with the advent of standards-based reforms in the 1990s. As a
result of these reforms, many states are implementing tests designed to measure student
performance against standards in core content areas.
Because criterion-referenced interpretations depend so directly on a clear explication of
what students can or cannot do, well-delineated descriptions of learning in the domain are key
to their effectiveness in communicating about student performance. Test results should be
reported in relation to a model of learning. The ways people learn the subject matter and
different states of competence should be displayed and made as recognizable as possible to
educators, students, and the public to foster discussion and shared understanding of what
constitutes academic achievement.
Fairness
Fairness in testing is defined in many ways (see AERA, APA, NCME, 1999), but at its
core is the idea of comparable validity: a fair test is one that yields comparably valid inferences
from person to person and group to group (NRC, 1999d). An assessment task is considered
biased if construct-irrelevant characteristics of the task result in different meanings for different
subgroups. Currently, bias tends to be identified through expert review of items. Such a finding
is merely judgmental, however, and in and of itself may not warrant removal of items from an
Considerations for an AA-MAS
Page 137
assessment. Also used are statistical DIF (differential item functioning) analyses, which identify
items that produce differing results for members of particular groups after the groups have been
matched in ability with regard to the attribute being measured. However, DIF is a statistical
finding and again may not warrant removal of items from an assessment. Some researchers
have therefore begun to supplement existing bias-detection methods with cognitive analyses
designed to uncover the reasons why items are functioning differently across groups, in terms of
how students think about and approach the problems (e.g., Lane, Wang, and Magone, 1996;
Zwick and Ercikan, 1989).
A particular set of fairness issues involves the testing of students with disabilities. A
substantial number of children who participate in assessments do so with accommodations
intended to permit them to participate meaningfully. For instance, a student with a severe
reading and writing disability might be able to take a chemistry test with the assistance of a
computer-based reader and dictation system. Unfortunately, little evidence currently exists
about the effects of various accommodations on the inferences one might wish to draw about
the performance of individuals with disabilities (NRC, 1997), though some researchers have
taken initial steps in studying these issues (Abedi, Hofstetter, & Baker, 2001). Therefore,
cognitive analyses are also needed to gain insight into how accommodations affect task
demands, as well as the validity of inferences drawn from test scores obtained under such
circumstances.
Conclusions & Caveats
A major thesis of this chapter is that the task of developing assessments tied to modified
achievement standards needs to take seriously what we know about aspects of human
cognition and its development, especially in domains of academic achievement and
performance. This is much easier said than done, especially when we try to consider students
who are hard to define and classify (see Quenemoen, Chapter 2, this volume) and whose
Considerations for an AA-MAS
Page 138
performance in the regular education context and on general academic achievement tests
consistently leaves much to be desired. Many of the issues raised in this chapter pertain to the
assessment of any and all students. It would behoove us to pay as much attention to a careful
definition of academic achievement standards and assessments for the majority of students as
we might propose to give to a subgroup of the population for whom we wish to define ―modified
achievement standards‖ and develop appropriate assessments. While there is much we know
about aspects of the development of competence in the regular education population there is
much that we don‘t know about ―cognition‖ in selective parts of the school age population. This
knowledge gap has major implications for defining appropriate assessment targets and
appropriate modes of assessment.
Given that we can never really know all that we need to know, and that there are
pragmatic problems to be solved in the design of assessments for students with typical low
levels of achievement, it is perhaps useful to consider some of the modifications that have been
proposed for the assessment of this group of students and whether such design features can be
justified. One such example is reducing cognitive load in questions through various means such
as presenting a smaller number of distracters. While such a modification might well reduce
construct irrelevant variance associated with working memory or metacognitive monitoring
issues, it is a far cry from the type of deep engagement with assessment design issues related
to the measurement construct that one would like to see. Such modifications may in fact
change the nature of the construct assessed, but they do little to engage the issue of what the
targets for learning should be and how such targets are manifest in the types of knowledge
tapped by specific test items. Furthermore, work needs to be done to establish the cognitive
processing validity of such modifications, which includes ruling out the possibility that improved
performance has little to do with the nature of the knowledge and skill that is the intended
achievement target. This is not to say, however, that attention to issues of processing load and
Considerations for an AA-MAS
Page 139
―construct irrelevant variance‖ should not be considered in the design of assessment situations
for students who may experience attentional processing issues.
A second example is choosing items that have a lower level of cognitive difficulty or
―depth of knowledge‖ but that still represent the appropriate grade-level content standards.
Aside from the fact that item difficulty can be driven by multiple factors that vary in their
relationship to the construct one proposes to measure, it is important that any such efforts
grapple in some detail with precisely what is meant by depth of knowledge. This can only be
done by taking seriously an analysis of the nature of domain knowledge and competence and
then selecting items that are purposely designed to assess some restricted aspects of the
domain. Such a choice can be done in a principled way and it necessarily brings with it the
implication that the new assessment has a different construct representation than the general
education assessment. Regardless of how one chooses to approach the process of changing
difficulty, decisions can‘t be made on the basis of a superficial level of analysis of the nature of
knowledge and skill desired at a particular grade level nor a similarly superficial analysis of how
it articulates with some progression of knowledge and skill at both higher and lower grade
levels.
A third possible approach is using a ―dynamic assessment‖ procedure with varying levels
of prompting and scaffolding. Such a procedure might well be justifiable when assessment is
viewed as a test of transfer and the goal is to see how far a student is capable of transferring his
or her knowledge. As was noted above, one must still engage deeply with an analysis of what
defines competence and acceptable achievement at the particular grade level. On that basis
one can determine how a dynamic assessment or scaffolding process can be used to provide
an estimate of the intended construct at the desire level of attainment.
One way to view the preceding is as a plea for states, and the field in general, to move
cautiously in the development of modified achievement standards and in the implementation of
assessments designed to provide valid measures of those standards. Clearly, there is much
Considerations for an AA-MAS
Page 140
that we still need to know. A ―design-based‖ research strategy (e.g., Kelly, Lesh, & Baek, 2008),
in which there are serious, small-scale attempts to design and validate such assessments
through iterative cycles of empirical testing, redesign, and refinement, may be the most
appropriate and cost-effective model for the field at this point in time. Such an approach has
multiple advantages including avoiding considerable investment in an approach to assessment
that may have limited validity and utility. One critical goal of the pursuit of modified achievement
standards should be to provide resources and information that allow educators to engage in a
better integration of curriculum, instruction, and assessment for precisely those students whose
achievement is characteristically well below desired levels.
Considerations for an AA-MAS
Page 141
References
Abedi, J., Hofstetter, & Baker, E. (2001). NAEP math performance and test accommodations:
Interactions with student language background (CSE Technical Report 536). Los
Angeles: National Center for Research on Evaluation, Standards, and Student Testing
(CRESST), University of California, Los Angeles.
Adams, J., Treiman, R., & Pressley, M. (1998). Reading, writing, and literacy. In I.E. Sigel and
K.A. Renninger, (Eds.), Handbook of Child Psychology, Fifth Edition, Vol. 4: Child
Psychology in Practice (pp. 275-355). New York: Wiley.
American Association for the Advancement of Science. (2001). Atlas of science literacy.
Washington, DC: Author.
American Educational Research Association, American Psychological Association, & National
Council of Measurement in Education (1999). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
Anderson, J.R. (Ed.) (1981). Cognitive skills and their acquisition. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Ashcraft, M.H. (1982). The development of mental arithmetic: A chronometric approach.
Developmental Review, 2, 213-236.
Ashcraft, M.H. (1983). Simulating network retrieval of arithmetic facts (Report No. 1983/10).
Pittsburgh: University of Pittsburgh, Learning Research and Development Center.
Ashcraft, M.H. (1985). Children's knowledge of simple arithmetic: A developmental model and
simulation. Unpublished manuscript, Cleveland State University.
Ashcraft, M.H. (1987). Children's knowledge of simple arithmetic: A developmental simulation.
In J. Bisanz, C. Brainerd, & R. Kail (Eds.), Formal methods in developmental psychology
(pp. 302-338). New York: Springer-Verlag.
Ashcraft, M.H., & Battaglia, J. (1978). Cognitive arithmetic: Evidence for retrieval and decision
processes in mental addition. Journal of Experimental Psychology: Human Learning
and Memory, 4, 527-538.
Ashcraft, M.H., & Fierman, B.A., & Bartolotta, R. (1984). The production and verification tasks
in mental addition: An empirical comparison. Developmental Review, 4, 157-170.
Ashcraft, M.H., & Stazyk, E.H. (1981). Mental addition: A test of three verification models.
Memory & Cognition, 9, 185-196.
Baddeley, A. (1986). Working memory. Oxford: Oxford University Press.
Baroody, A.J., Bajwa, N.P., & Eiland, M. (2009). Why can't Johnny remember the basic facts?
Developmental Disabilities Research Reviews, 15(1), 69 – 79.
Baxter, G. P., & Glaser, R. (1998). Investigating the cognitive complexity of science
assessments. Educational Measurement: Research and Practice, 17(3), 37-45.
Considerations for an AA-MAS
Page 142
Beck, I.L., & McKeown, M.G. (2001). Text talk: Capturing the benefits of read-aloud experiences
for young children. The Reading Teacher, 55(1), 10-20.
Beck, I.L., McKeown, M.G., Hamilton, R.L., & Kucan, L. (1997). Questioning the author: An
approach for enhancing student engagement with text. Newark, DE: International
Reading Association.
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education,
5(1), 7-73.
Brown, J.S., & Burton, R.R. (1978). Diagnostic models for procedural bugs in mathematics.
Cognitive Science, 4, 379-426.
Brown, J.S., & Van Lehn, K. (1980). Repair theory: A generative theory of bugs in procedural
skills. Cognitive Science, 4, 379-426.
Burton, R.B. (1981). DEBUGGY: Diagnosis of errors in basic mathematical skills. In D.H.
Sleeman & J.S. Brown (Eds.), Intelligent tutoring systems. London: Academic Press.
Carpenter, T.P. (1985). Learning to add and subtract: An exercise in problem solving. In E.A.
Silver (Ed.), Teaching and learning mathematical problem solving: Multiple research
perspectives. (pp. 17-40). Hillsdale, NJ: Lawrence Erlbaum Associates.
Carpenter, T.P., Moser, J.M., & Romberg, T.A. (Eds.). (1982). Addition and subtraction: A
cognitive perspective. Hillsdale, N.J.: Lawrence Erlbaum Associates.
Carpenter, T., Fennema, E., & Franke, M. (1996). Cognitively guided instruction: A knowledge
base for reform in primary mathematics instruction. Elementary School Journal, 97(1), 320.
Chi, M.T.H., Feltovich, P. J., & Glaser, R. (1981). Categorization and representation of physics
problems by experts and novices. Cognitive Science, 5, 121-152.
Cobb, P., Wood, T., Yackel, E., Nicholls, J., Wheatley, G., Trigatti, B., & Perlwitz, M. (1991).
Assessment of a problem-centered second-grade mathematics project. Journal for
Research in Mathematics Education, 22(1), 3-29.
Cole, M., Gay, J., & Glick, J. (1968). Some experimental studies of Kpelle quantitative behavior.
Psychonomic Monograph Supplements, 2(10), 173-190.
Connor, F. P. (1983). Improving school instruction for learning disabled children: The Teachers
College Institute. Exceptional Education Quarterly, 4, 23-44.
Corcoran, T.B., Mosher, F.A. & Rogat, A. (2009). Learning progressions in science: An
evidence-based approach to reform. Consortium for Policy Research in Education,
Center on Continuous Instructional Improvement, Teachers College, Columbia
University, New York.
Dickinson, D.K., & Sprague, K. E. (2001) The nature and impact of early childhood care
environments on the language and early literacy development of children from low-
Considerations for an AA-MAS
Page 143
income families. In S.B. Neuman and D.K. Dickinson (Eds.), Handbook of early literacy
research (pp. 263-280). New York: Guilford Press.
Duncan, R.G., & Hmelo-Silver, C. (2009). Learning progressions: Aligning curriculum,
instruction, and assessment. Journal for Research in Science Teaching, in press.
Dweck, C., & Legget, E. (1988). A social-cognitive approach to motivation and personality.
Psychological Review, 95, 256-273.
Ehri, L.C. (1998) Grapheme-phoneme knowledge is essential for learning to read words in
English. In J.L. Metsala and L.C. Ehri, (Eds.), Word Recognition in Beginning Literacy
(pp. 3-40). Mahwah, NJ: Erlbaum.
Fay, A. & Klahr, D. (1996). Knowing about guessing and guessing about knowing:
Preschoolers' understanding of indeterminacy. Child Development, 67, 689-716.
Fletcher, J.M., Foorman, B. R., Boudousquie, A., Barnes, M., Schatschneider, C., & Francis,
D.J. (2002). Assessment of reading and learning disabilities: A research-based,
treatment-oriented approach. Journal of School Psychology, 40, 27-63.
Foorman, B.R., & Torgesen, J.K. (2001). Critical elements of classroom and small-group
instruction promote reading success in all children. Disabilities Research and Practice,
16, 202-211.
Fredriksen, N., Lesgold, A., Glaser., & Shafton, M. (Eds.) (1988). Diagnostic monitoring of skill
and knowledge. Hillsdale, NJ: Lawrence Erlbaum Associates.
Friend, J., & Burton, R. (1981). A teacher's manual of subtraction bugs (working paper). Palo
Alton, CA: Xerox Palo Alto Research Center.
Fuchs, L.S., Compton, D.L., Fuchs, D., Paulsen, K., Bryant, J.D., Hamlett, C.L. (2005). The
prevention, identification, and cognitive determinants of math difficulty. Journal of
Educational Psychology. 97(3), 493-513.
Fuson, K.C. (1982). An analysis of the counting-on procedure in addition. In T.P. Carpenter,
J.M. Moser, & T.A. Romberg (Eds.) Addition and subtraction: A cognitive perspective
(pp. 67-81). Hillsdale, NJ: Lawrence Erlbaum Associates.
Fuson, K.C. (1984). More complexities in subtraction. Journal for Research in Mathematics
Education, 15, 214-225.
Fuson, K.C., & Hall, J.W. (1983). The acquisition of early number word meanings: A
conceptual analysis and review. In H.P. Ginsburg, (Ed.), The development of
mathematical thinking (pp. 49-107). New York: Academic Press.
Geary, D. (1995). Reflections of evolution and culture in children's cognition: Implications for
mathematical development and instruction. American Psychologist, 50(1), 24-37.
Gelman, R. (1990). First principles organize attention to and learning about relevant data:
Number and the animate-inanimate distinction as examples. Cognitive Science, 14, 79106.
Considerations for an AA-MAS
Page 144
Gelman, R., & Gallistel, C.R. (1978). The child's understanding of number. Cambridge, MA:
Harvard University Press.
Ginsburg, H.P., Greenes, C., & Balfanz, R. (2003). Big math for little kids. Parsippany, NJ: Dale
Seymour Publications.
Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some
questions. American Psychologist, 18, 519-521.
Glaser, R. (1992). Expert knowledge and processes of thinking. In D. F. Halpern (Ed.),
Enhancing thinking skills in the sciences and mathematics (pp. 63-75). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Goldman, S.R., Mertz, D., & Pellegrino, J.W. (1988). Extended practice of basic addition facts:
Strategy changes in learning disabled students. Cognition & Instruction, 5(3), 223-265.
Graesser, A.C., Millis, K.K. & Zwaan, R.A. (1997). Discourse comprehension. Annual Review of
Psychology, 48, 163-189.
Graves, M.F. & Slater, W.H. (1987). Development of reading vocabularies in rural
disadvantaged students, intercity disadvantaged students and middle class suburban
students. Paper presented at the conference of the American Educational Research
Association, Washington, DC, April.
Greeno, J.G. (1978). A study of problem solving. In R. Glaser (Ed.) Advances in instructional
psychology (Vol. 1) (pp.13-75). Hillsdale, NJ: Lawrence Erlbaum Associates.
Greeno, J. G., Pearson, P. D., & Schoenfeld, A. H. (1996). Implications for NAEP of research
on learning and cognition. Report of a study commissioned by the National Academy of
Education. Panel on the NAEP Trial State Assessment, Conducted by the Institute for
Research on Learning. Stanford, CA: National Academy of Education.
Griffin, S., & Case, R., (1997). Re-thinking the primary school math curriculum: An approach
based on cognitive science. Issues in Education, 3(1), 1-49.
Griffin, S. A., Case, R., & Siegler, R. S. (1994). Rightstart: Providing the central conceptual
prerequisites for first formal learning of arithmetic to students at risk for school failure. In
K. McGilly (Ed.), Classroom lessons: Integrating cognitive theory and classroom
practice (pp. 1-50). Cambridge, MA: MIT Press/Bradford Books.
Groen, G.J., & Parkman, J.M. (1972). A chronometric analysis of simple addition.
Psychological Review, 79, 329-343.
Hamann, M.S., & Ashcraft, M.H. (1985). Simple and complex mental addition across
development. Journal of Experimental Child Psychology, 40, 49-72.
Hamilton, L. S., Nussbaum, E. M., & Snow, R. E. (1997). Interview procedures for validating
science assessments. Applied Measurement in Education, 10(2), 181-200.
Hart, B., & Risley, T.R. (1995). Meaningful differences in the everyday experience of young
American children. Baltimore: Paul H. Brookes Publishing Co.
Considerations for an AA-MAS
Page 145
Hatano, G. (1990). The nature of everyday science: A brief introduction. British Journal of
Developmental Psychology, 8, 245-250.
Hickey, D., & Pellegrino, J.W. (2005). Theory, level, and function: Three dimensions for
understanding transfer and student assessment. In J.P. Mestre (Ed.). Transfer of
learning from a modern multidisciplinary perspective (pp. 251-293). Greenwich, CO:
Information Age Publishing.
Houlihan, D.M., & Ginsburg, H.G. (1981). The addition methods of first- and second-grade
children. Journal for Research in Mathematics Education, 12, 95-106.
Huttenlocher, J. (1998). Language input and language growth. Preventive Medicine, 27, 195199.
Kaiser, M.K., Proffitt, D.R., and McCloskey, M. (1985). The development of beliefs about falling
objects. Perception & Psychophysics, 38(6), 533-539
Kalchman, M., Moss, J., & Case, R. (2001). Psychological models for development of
mathematical understanding: Rational numbers and functions. In S. Carver and D. Klahr,
(Eds.), Cognition and Instruction: Twenty-Five Years of Progress (pp. 1-38). Mahwah,
NJ: Erlbaum.
Kelly, A.E., Lesh, R.A., & Baek, J.Y. (Eds.) (2008). Handbook of design research methods in
education: Innovations in science, technology, engineering, and mathematics learning
and teaching. New York: Routledge.
Kintsch, W. (1998). Paradigms of comprehension. Oxford, England: Oxford University Press.
Lampert, M. (1986). Knowing, doing, and teaching multiplication. Cognition and Instruction, 3,
305-342.
Lane, L, Wang, N, & Magone, M. (1996). Gender-related differential item functioning on a
middle-school mathematics performance assessment. Educational Measurement:
Issues and Practice, 15(4), 21-28.
Lave, J. (1988). Cognition in practice. Cambridge, England: Cambridge University Press.
Lemke, J. J. (2000). Across the scale of time: Artifacts, activities, and meaning in ecosocial
systems. Mind, Culture, and Activity, 7 (4), 273-290.
Lesgold, A., & Perfetti, C.A. (Eds.). (1981). Interactive process in reading. Hillsdale, NJ:
Lawrence Erlbaum Associates.
Lesh, R., & Landau, M. (Eds.). (1983). Acquisition of mathematics concepts and processes.
New York: Academic Press.
Magone, M. E., Cai, J., Silver, E. A., & Wang, N. (1994). Validating the cognitive complexity
and content quality of a mathematics performance assessment. International Journal of
Educational Research, 21(3), 317-340.
Mannes, S.M., & Kintsch, W. (1987). Knowledge organization and text organization. Cognition
Considerations for an AA-MAS
Page 146
and Instruction, 4, 91-115.
Massey, C.M., & Gelman, R. (1988). Preschoolers decide whether pictured unfamiliar objects
can move themselves. Developmental Psychology, 24, 307-317.
Messick, S. (1993). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13103). Phoenix, AZ: Oryx Press.
Miller, S. P. (1997). Educational aspects of mathematics disabilities. Journal of Learning
Disabilities, 30(1), 47-56.
Miyake, A., Just, M.A., & Carpenter, P.A. (1994). Working memory constraints on the resolution
of lexical ambiguity: Maintaining multiple interpretations in neutral contexts. Journal of
Memory and Language, 33(2), 175-202.
Morales, R., Shute, V., & Pellegrino, J.W. (1985). Developmental differences in understanding
and solving simple mathematics word problems. Cognition and Instruction, 2, 41-57.
National Council of Teachers of Mathematics (1995). Assessment standards for school
mathematics. Reston, VA: Author.
National Council of Teachers of Mathematics (2000). Principles and standards for school
mathematics. Reston, VA: Author.
National Institute of Child Health and Human Development (2000). Teaching children to read:
An evidence-based assessment of the scientific research literature on reading and Its
implications for reading instruction. Report of the National Reading Panel. NIH
Publication No. 00-4769. Washington, DC: U.S. Department of Education.
National Research Council (1997). Educating one and all: Students with disabilities and
standards-based reform. Committee on Goals 2000 and the Inclusion of Students with
Disabilities. L. M. McDonnel, M. J. McLaughlin, and P. Morison, Eds. Washington, DC:
National Academy Press.
National Research Council (1998). Preventing reading difficulties in young children. Committee
on the Prevention of Reading Difficulties in Young Children. C.E. Snow, M. Burns, and
M. Griffin, Eds. Washington, DC: National Academy Press.
National Research Council (1999a). Grading the nation's report card: Evaluating NAEP and
transforming the assessment of educational progress. Committee on the Evaluation of
National and State Assessments of Educational Progress. J. W. Pellegrino, L. R. Jones,
and K. J. Mitchell, Eds. Washington, DC: National Academy Press.
National Research Council (1999b). How people learn: Brain, mind, experience, and school.
Committee on Developments in the Science of Learning. J. D. Bransford, A. L. Brown,
and R. R. Cocking, Eds. Washington, DC: National Academy Press.
National Research Council (1999c). How people learn: Bridging research and practice.
Committee on Learning Research and Educational Practice. M. S. Donovan, J. D.
Bransford, and J. W. Pellegrino, Eds. Washington, DC: National Academy Press.
Considerations for an AA-MAS
Page 147
National Research Council (1999d). High stakes: Testing for tracking, promotion, and
graduation. Committee on Appropriate Test Use. J. P. Heubert and R. M. Hauser, Eds.
Washington, DC: National Academy Press.
National Research Council (2000). How people learn: Brain, mind, experience, and school
(Expanded Edition). Committee on Developments in the Science of Learning and
Committee on Learning Research and Educational Practice. J. D. Bransford, A. L.
Brown, R. R. Cocking, M. S. Donovan, & J. W. Pellegrino, Eds. Washington, DC:
National Academy Press.
National Research Council (2001a). Knowing what students know: The science and design of
educational assessment. Committee on the Foundations of Assessment. J. Pellegrino,
N. Chudowsky, and R. Glaser, Eds. Washington, DC: National Academy Press.
National Research Council (2001b). Adding It Up: Helping children learn mathematics.
Mathematics Learning Study Committee. J. Kilpatrick, J. Swafford, and B. Findell, Eds.
Washington, DC: National Academy Press.
National Research Council (2003). Assessment in support of learning and instruction: Bridging
the gap between large-scale and classroom assessment. Committee on Assessment in
Support of Learning and Instruction. Washington, DC: National Academies Press.
National Research Council (2005). Systems for state science assessments. Committee on
Testing Design for K-12 Science Achievement. M.R. Wilson, & M.W. Bertenthal, Eds.
Washington DC: The National Academy Press.
National Research Council (2007). Taking science to school: Learning and teaching science in
grade K-8. Committee on Science Learning, Kindergarten through eighth grade. R.A.
Duschl, H.A., Schweingruber, & A.W. Shouse, Eds. Washington DC: The National
Academy Press.
Newcombe, N.S., & Huttenlocher, J. E. (2000). Making space. Cambridge, MA: MIT Press.
Norman, D.A. (1981). Categorization of action slips. Psychological Review, 88, 1-15.
O‘Connor, R.E., & Jenkins, J.R. (1999). The prediction of reading disabilities in kindergarten and
first grade. Scientific Studies of Reading, 3,159-197.
Olson, D.R. (1977). From utterance to text: The bias of language in speech and writing. Harvard
Educational Review, 47, 257-281.
Pearson, P.D., & Hamm, D.N. (2002). The assessment of reading comprehension: A review of
practice—past, present, and future. Santa Monica, CA: RAND Corporation.
Pellegrino, J.W. (1988). Mental models and mental tests. In H. Wainer & H.I. Braun (Eds.),
Test validity (pp. 49-60). Hillsdale, NJ: Lawrence Erlbaum Associates.
Pellegrino, J. W., Baxter, G. P., and Glaser, R. (1999). Addressing the "two disciplines"
problem: Linking theories of cognition and learning with assessment and instructional
practice. In A. Iran-Nejad and P. D. Pearson (Eds.), Review of research in education
(vol. 24) (pp. 307-353). Washington, DC: American Educational Research Association.
Considerations for an AA-MAS
Page 148
Pellegrino, J.W. & Hickey, D. (2006). Educational assessment: Towards better alignment
between theory and practice. In L. Verschaffel, F. Dochy, M. Boekaerts, & S. Vosniadou
(Eds.). Instructional psychology: Past, present and future trends. Sixteen essays in
honour of Erik De Corte (Advances in Learning and Instruction Series) (pp 169-189).
Oxford: Elsevier.
Perfetti, C.A. 1992 The representation problem in reading acquisition. In P.B. Gough, L.C. Ehri,
and R. Treiman (Eds.), Reading Acquisition (pp. 145-174). Hillsdale, NJ: Lawrence
Erlbaum.
Pressley, M., Johnson, C.J., Symons, S., McGoldrick, J.A., & Kurita, J.A. (1989). Strategies that
improve children‘s memory and comprehension of text. Elementary School Journal,
90(1), 3-32.
RAND (2002a). Toward an R and D program in reading comprehension. RAND Reading Study
Group, Catherine Snow, Chair. Santa Monica, CA: RAND.
RAND (2002b). Mathematical proficiency for all students: Toward a strategic research and
development program in mathematics education. RAND Mathematics Study Panel,
Deborah Loewenberg Ball, Chair. DRU-2773-OERI. Santa Monica, CA: RAND.
Resnick, L.B. (1982). Syntax and semantics in learning to subtract. Report # LRDC-1982/8,
Learning Research and Development Center, University of Pittsburgh, Pittsburgh, PA.
Resnick, L.B. (1984). Beyond error analysis: The role of understanding in elementary school
arithmetic. In H.N. Creek (Ed.), Diagnostic and prescriptive mathematics: Issues, ideas,
and insight (pp. 181-205). Kent, OH: Research Council for Diagnostic and Prescriptive
Mathematics.
Riley, M., Greeno, J., & Heller, J. (1983). Development of children‘s problem-solving ability in
arithmetic. In H. Ginsburg (Ed.), The development of mathematical thinking (pp. 153–
196). New York: Academic Press.
Rogoff, B. (1990). Apprenticeship in thinking: Cognitive development in social context. New
York: Oxford University Press.
Rosenbloom, P., & Newell, A. (1987). Learning by chunking: A production system model of
practice. In D. Klahr & P. Langley (Eds.), Production system models of learning and
development (pp. 221-286). Cambridge, MA: MIT Press.
Ruiz-Primo, M.A., Shavelson, R.J., Hamilton, L., & Klein, S. (2002). On the evaluation of
systemic science education reform: Searching for instructional sensitivity. Journal of
Research in Science Teaching, 39, 369-393.
Russell, R.L., & Ginsburg, H.P. (1984). Cognitive analysis of children's mathematics difficulties.
Cognition and Instruction, 1, 217-244.
Sadler, R. (1989). Formative assessment and the design of instructional systems. Instructional
Science, 18, 119-144.
Schoenfeld, A.H. (1985). Mathematical problem solving. Orlando, FL: Academic Press.
Considerations for an AA-MAS
Page 149
Shepard, L. A. (2000). The role of assessment in a learning culture. Educational Researcher,
29 (7), 4-14.
Siegler, R. S. (1998). Children's thinking (3rd ed.). Upper Saddle River, NJ: Prentice Hall.
Siegler, R.S., & Shrager, J. (1984). Strategy choices in addition and subtraction: How do
children know what to do? In C. Sophian (Ed.), Origins of cognitive skills (pp. 229-293).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Simon, H.A. (1980). Problem solving and education. In D.T. Tuma and F. Reif (Eds.), Problem
solving and education: Issues in teaching and research (pp. 81-96). Hillsdale, NJ:
Erlbaum.
Smith, C, Wiser, M., Anderson, C.W, & Krajcik, J. (2006). Implications of children‘s learning for
assessment: A proposed learning progression for matter and the atomic molecular
theory. Measurement, 14(1&2), 1-98.
Starkey, P., & Cooper, R.G. (1980). Perception of numbers by human infants. Science, 210,
1033-1035.
Starkey, P., & Gelman, R. (1982). The development of addition and subtraction abilities prior to
formal schooling in arithmetic. In T.P. Carpenter, J.M. Moser, & T.A. Romberg (Eds.),
Addition and subtraction: A cognitive perspective (pp. 99-116). Hillsdale, NJ: Lawrence
Erlbaum Associates.
Steffe, L., Thompson, P., & Richards, J. (1982). Children's counting in arithmetical problem
solving. In T.P. Carpenter, J.M. Moser, & T. Romberg (Eds.). Addition and subtraction:
A cognitive perspective. Hillsdale, N.J.: Lawrence Erlbaum Associates.
Steffe, L.P., von Glaserfeld, E., Richards, J., & Cobb, P. (1983). Children's counting types:
Philosophy, theory, and application. New York: Praeger Scientific.
Stiggins, R. J. (1997). Student-centered classroom assessment. Upper Saddle River, NJ:
Prentice-Hall.
Svenson, O. (1975). Analysis of time required by children for simple additions. Acta
Psychologica, 39, 289-302.
Svenson, O., & Broquist, S. (1975). Strategies for solving simple addition problems: A
comparison of normal and subnormal children. Scandinavian Journal of Psychology, 16,
143-151.
Svenson, O., & Hedenborg, M.L. (1979). Strategies used by children when solving simple
subtractions. Acta Psychologica, 43, 477-489.
Swanson, H.L., & Jerman, O. (2006). Math disabilities: A selective meta-analysis of the
literature. Review of Educational Research, 76(2), 249-274.
Thorndike, E.L. (1931). Human learning. New York: Century.
Considerations for an AA-MAS
Page 150
Torgesen, J.K. (2002). The prevention of reading difficulties. Journal of School Psychology,
40(1), 7-26.
Unsworth, N. & Engle, R.W. (2007). The nature of individual differences in working memory
capacity: Active maintenance in primary memory and controlled search from secondary
memory. Psychological Review, 114(1), 104-132.
Van Lehn, K. (1983). Bugs are not enough: Empirical studies of bugs, impasses and repairs in
procedural skills. Journal of Mathematical Behavior, 3, 3-71.
Van Lehn, K. (1990). Mind bugs: the origins of procedural misconceptions. Cambridge, MA: MIT
Press.
Wagner, R.K., Torgesen, J.K., Laughon, N.P., Simmons, K., & Rashotte, C.A. (1993).
Development of young readers‘ phonological processing abilities. Journal of Educational
Psychology, 85, 83-103.
Wagner, R.K., Torgesen, J.K., Rashotte, C.A., Hecht, S.A., Barker, T.A., Burgess, S.R.,
Donahue, J., & Garon, T. (1997). Changing relations between phonological processing
abilities and word-level reading as children develop from beginning to skilled readers: A
5-year longitudinal study. Developmental Psychology, 33, 468-479.
Webb, N. L. (1997). Criteria for alignment of expectations and assessments in mathematics and
science education. National Institute for Science Education and Council of Chief State
School Officers Research Monograph No. 6. Washington, DC: Council of Chief State
School Officers.
Whitehurst, G.J., & Lonigan, C.J. (2001). Emergent literacy: Development from prereaders to
readers. In S.B. Neuman and D.K. Dickinson, eds.Handbook of Early Literacy Research
(pp. 11-29). New York: Guilford Press.
Wiggins, G. (1998) Educative assessment: Designing assessments to inform and improve
student performance. San Francisco: Jossey-Bass.
Wiliam, D. (2007). Keeping learning on track: formative assessment and the regulation of
learning. In F. K. Lester Jr. (Ed.), Second handbook of mathematics teaching and
learning (pp. 1053-1098). Greenwich, CT: Information Age Publishing.
Winkelman, H.J., & Schmidt, J. (1974). Associative confusions in mental arithmetic. Journal of
Experimental Psychology, 102, 734-736.
Woods, S.S., Resnick, L.B., & Groen, G.J. (1975). An experimental test of five models for
subtraction. Journal of Educational Psychology, 67, 17-21.
Young, R.M., & O'Shea, T. (1981). Errors in children's subtraction. Cognitive Science, 5, 153177.
Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP history
assessment. Journal of Educational Measurement, 26(1), 55-66.
Considerations for an AA-MAS
Page 151
SECTION II
TEST DESIGN: UNDERSTANDING CONTENT AND ACHIEVEMENT
STANDARDS AND INCORPORATING APPROPRIATE ITEM
MODIFICATIONS
This section moves the discussion from one about the students—who they are, how they
learn, and how they should be instructed—to the assessment itself. Once we have a grasp on
which students might be best served by an alternate assessment based on modified
achievement standards, we need to determine how to take our understanding of the students
and apply it to good test design. Of critical importance is to understand how to cover the same
breadth and depth as a general assessment and yet make it less difficult. These modified
achievement standards can be less rigorous, but what does that truly mean?
Chapter 5, by David Pugalee and Bob Rickelman, bridges us from the discussions of
Section I to lay the foundation for good test design. It focuses on content standards and
curriculum and describes how content standards are developed. Then it moves to the key issue
of how to maintain the same content, only modifying the achievement standards. It ends with
some suggestions on ways to enhance or revise items to provide scaffolding for students who
may need additional supports in order to show what they know and can do. The authors point
out that the scaffolds described only work if they are incorporated both in instruction and
assessment.
Chapter 6, by Cathy Welch and Steve Dunbar, picks up where Chapter 5 leaves off,
focusing on types of modifications that can be made to the general assessment to make it more
appropriate for low-achieving students with disabilities. It also provides an overview of best item
and test development practices and uses these considerations to frame the discussion of areas
for modification. The authors then address the psychometric consequences of test modifications
as they play out in the assembly of test forms and in the analysis of technical characteristics of
items and test forms.
Considerations for an AA-MAS
Page 152
Then, in Chapter 7 by Marianne Perie, the focus turns to the modified achievement
standards. Here, the issue of rigor and what standard students are measured against is
addressed. This chapter focuses on the main components of achievement standards—numbers
and names of levels, achievement level descriptors, and cut scores—and provides guidance on
how to develop each component. The theory is brought back to match both the test design from
Chapter 6 and student cognition, discussed in Chapter 4.
This section also benefited from several helpful reviews from the expert panel members
who reviewed these chapters. In particular, comments from Howard Everson, Suzanne Lane,
Brian Gong, and Claudia Flowers were especially insightful and helped inform the final chapters.
Considerations for an AA-MAS
Page 153
CHAPTER 5
UNDERSTANDING THE CONTENT
David K. Pugalee
Robert J. Rickelman
In order to understand the assessment process, whether discussing general
assessments or alternate assessments, it is essential to have a good basic understanding of the
content learning that is being assessed. The content domain, as explicated in state standards,
must be the continued focus of assessment and the underlying force that drives instruction. For
alternate assessments based on modified achievement standards (AA-MAS), students‘ work
must align with the published state grade-level standards. But how are these standards
developed? How do they link to the curriculum approved for use in schools? How do these
standards come into play when developing IEP goals? In the previous chapter, Pellegrino
described how cognition plays a major role in student learning and assessment. In Chapter 3,
Karvonen discussed the IEP process in detail, and suggested that the ―Opportunity to learn
requires a curriculum that is well-aligned to state standards and assessment‖ (p. 51). This
chapter will further these arguments by focusing on how the content standards reflect these
content domains and provide a framework for both testing and instruction for students who meet
the AA-MAS criteria, as discussed in Quenemoen (Chapter 2, this volume).
This chapter defines curriculum, explains the link to content standards, and describes
how content standards are developed by states. It is important to understand the difference
between content standards and achievement standards, so this differentiation will be made.
Finally, issues surrounding links between the general content and modified assessments will be
discussed, including examples related to mathematics and English/Language Arts. A discussion
related to the effects of scaffolding on instruction and assessment will conclude this chapter.
Considerations for an AA-MAS
Page 154
What is the Curriculum?
At a very basic level, a curriculum is a set of planned instructional activities that are
designed to allow students to document achievement of their knowledge and skills, including
how these skills can be applied to real-life situations. The goal of a curriculum is to provide a
comprehensive focus for instruction and learning within a school, a school system, and/or
across a state. They also provide a scope and sequence of skills within and across grade levels.
The curriculum generally drives the important factor of materials that will be used to implement
the curriculum, and often there are several choices among state-approved materials developed
by different sources within the state or at the national level. These materials are showcased in
teacher‘s manuals and related materials, detailing the overarching philosophy and theory behind
the development of the curriculum, and how it can be used across grade levels. These
philosophy statements are generally written by an expert editorial team—often including
university faculty who are experts in the field and by school personnel or state-level curriculum
experts. In short, the curriculum is the glue that holds the pieces together, informing both
teaching and learning, which should then link to the content standards assessed and
subsequent interpretation of the assessment, which, in turn, should then drive instruction. In this
continuous improvement loop (instruction – assessment – interpretation – instruction . . .), the
curriculum lays out the scope and sequence of skills aligned to the content standards that will
be taught and assessed in different subjects across different grade levels.
For example, a state may approve five different programs to be used to inform the
mathematics curriculum across the K-12 grade levels. In order to purchase materials with state
funding, a school district would have to choose from among the state-approved materials. It is
common for states to have textbook adoption committees, made up of experts within the content
field, who make decisions about the quality of the options and how well they align to the state
curriculum goals and approved content standards.
Considerations for an AA-MAS
Page 155
A less common option is that a set of materials can be developed at a much smaller
scale, to be used with smaller populations of students. For instance, in a grade level where a
specific state history is the focus for a course, related materials will likely be developed by instate experts, since the content would not appeal to anyone in other states. It is common for
individuals to develop these materials around a curriculum that will guide their decision making
at a local level.
In alternate assessments based on modified achievement standards, students must
document that they can meet the state grade-level content standards. What this means is that
they must be given the opportunity to learn and be assessed using the same curriculum as the
general population of students who are not using modified assessments. In other words, they
cannot be accountable for learning a different set of standards or for using a curriculum that is
not available to the general population of students within that state. Not only must they have
access to the general curriculum, but they must take part in a modified assessment that is
aligned to grade-level content standards. So, a student in the ninth grade could not be assessed
on standards established for fifth grade, even though that might be the grade level most
representative of that student‘s observed skill.
How are Content Standards Developed?
Content standards are generally developed in one of two ways. In some subject areas,
standards are established at the national level, and these subsequently drive the development
of individual state standards. For instance, in mathematics, the National Council of Teachers of
Mathematics (NCTM, 2000) has developed a set of standards for grades pre-kindergarten
through 12th grade. According to the Executive Summary of the 2000 Principles and Standards
for School Mathematics (available at
http://www.nctm.org/uploadedFiles/Math_Standards/12752_exec_pssm.pdf), these standards
were developed by a set of national content experts, with broad opportunities for input from
Considerations for an AA-MAS
Page 156
teachers and others, based on an extensive study of curriculum materials, state documents, and
best practice research, to:
Set forth a comprehensive and coherent set of learning goals for PK-12 math
Serve as a resource for educators and policymakers
Guide the development of curriculum frameworks, assessments, and instructional
materials, and
Stimulate ideas and ongoing conversations at the national, state, and local levels about
how best to help students gain a deep understanding of important mathematics (p.1).
These standards are used extensively to guide state standards committees, which
shape the NCTM standards to the needs of the specific state, including aligning them to state
content and assessments required of all students. So, while there may be minor differences in
the details of the state standards across states, the general standards themselves are very
consistent across states.
The second way that content standards are developed is within states, when nationallevel standards have not been developed, or when national-level content standards are fairly
generic, and perhaps not specifically aligned to grade levels. These standards are considered to
be more generic guiding principles, and can be helpful in developing an overall philosophy of
the goals for standards; but more specific, focused, and assessable statements must be crafted
by state-level experts to make sense of the continuous improvement cycle mentioned earlier.
The Reading and English/Language Arts (ELA) content area provides an example of this
type of standards development. In 1996, the National Council of Teachers of English (NCTE)
and the International Reading Association (IRA) published Standards for the English Language
Arts. This document was the result of five years of collaboration between these two professional
organizations. However, unlike the NCTM standards which will be discussed in detail later in
this chapter and which are broken down into specific content and process standards across
Considerations for an AA-MAS
Page 157
grade levels, these ELA standards are more generic. For instance, Standard One states that a
goal for ELA instruction should be that,
Students read a wide range of print and nonprint texts to build an understanding of
texts, of themselves, and of the cultures of the United States and the world; to
acquire new information; to respond to the needs and demands of society and the
workplace; and for personal fulfillment. Among these texts are fiction and nonfiction,
classic and contemporary works (p.3).
It is up to each individual state to determine how this general recommendation is
actualized within and across grade levels within the state. So these can be considered more like
guiding principles for content standard development rather than actual grade-level content
standards.
It is easy to see that, unlike the NCTM standards, these are not grade-band specific
standards, but rather 12 general standards supported by research and classroom vignettes of
what might happen in a classroom in which the standards were being implemented. Since these
are more recommendations than standards, states generally must craft their own standards
using committees made up of experts in the field who work in state departments of public
instruction, colleges and universities, and public and private schools. These committees meet
for several days to write the PK–12 state standards in each subject area, using (much like
NCTM did on a national level) previous state standards and current state and national policy,
along with scientific research in best practices, to put together a detailed list of standards for
each grade level. These standards are then generally widely disseminated among stakeholders
for additional input before being officially approved and implemented by the state department.
States vary greatly in how often these committees meet to update standards, whether
they come from national or state sources. Many states update grade-level content standards
every five years, but this cycle is often interrupted by national or state mandates and/or new
federal laws, so that standards may be changed more than once every five years to adhere to
Considerations for an AA-MAS
Page 158
new mandates, or even less than every five years if major mandates are expected to be
forthcoming and states want more guidance before proceeding with this tedious task.
Several important points must be made concerning the development of content
standards, especially in subject areas for which national standards have not been developed.
First, the published content standards, critically important to the assessment process, are
assumed to be the ―gold standard‖ within states—reference points of knowledge to which
students must achieve. But often there is no clear, scientific methodology behind the
development of these standards, as mentioned earlier. In fact, when one state was recently
working on its modified achievement standards and developing learning maps for each content
standard in each grade level, the teams working on developing the maps struggled to
understand what some of the state content standards actually meant, and how they could be
taught and assessed and interpreted across the general, modified, and alternate assessment
systems. After much struggle, frustration, and consultation, the teams decided that some of the
state standards were just poorly written, and some team members expressed a strong interest
in being named to content standard writing committees in the future. Another point is that when
these content standards are being developed, there is often no deliberate thought about how
each content standard will be assessed, which was one problem that these teams discovered
In other words, the development of this ―gold standard‖ is sometimes quite unscientific
and can be heavily influenced by one or two strong committee members who may or may not
espouse a certain ideology of what should be taught and learned within the state. So, while the
process of developing content standards obviously has to take place, understanding both how
these standards are developed and how they may be unduly influenced by individual committee
members is important to keep in mind. Sending out drafts of standards to a broad variety of
stakeholders can be helpful in terms of quality control, but it remains, by nature, an imperfect
system.
Considerations for an AA-MAS
Page 159
As mentioned in the previous chapter (Pellegrino, chapter 4, this volume), learning
progressions can provide guidance about how a typical skill will be developed on a theoretical
level, but these learning progressions have limited usefulness, since some students (perhaps
many) do not follow typical learning patterns. Some states use the term ―learning maps‖, which
should not be confused with learning progressions, to offer suggested pathways in which a
student can learn and be assessed across achievement standards, with the understanding that,
just like a road map, these pathways are meant to offer guidance, but can (and often must)
allow for deviation to account for individual differences. As mentioned by Pellegrino (chapter 4,
this volume) this flexibility is especially important for students being assessed against alternate
and modified achievement standards, since they may be more atypical than their peers in
adhering to both theoretical and practical expectations for learning and developing expertise for
both declarative and procedural knowledge.
In other words, there is little solid evidence that there is only one way in which all
students acquire knowledge. There are, more likely, multiple pathways, and learning maps offer
practical guidance into how these might develop, especially in terms of depth of knowledge of
the standards. These would generally be developed for each grade-level standard, and
appropriate assessment measures would need to be developed to allow students the
opportunity to show performance across the levels. These maps also link to achievement
standards, often by taking into account depth of knowledge, with the assumption that more
depth can demonstrate a higher achievement level. The effectiveness of such maps in positively
impacting student learning is dependent on a rich system of formative assessment processes,
aligned to instruction that provides pictures of what students are able to do on multiple tasks
related to the standard. Examples of learning or progress maps in ELA and mathematics can be
seen in the following:
Considerations for an AA-MAS
Page 160
Published State Standard: Student will be able to understand simile and metaphor.
Developing
Student will be able
to correctly label
figurative language
and literal language
given lists of
statements.
Proficient
Student will be able to
correctly identify a simile
and a metaphor embedded
in a paragraph of text.
Target
Student will be able
to use an
appropriate simile
or a metaphor in
their own written
work.
Advanced
Student will be able
to use a simile or a
metaphor in their
own written work and
will be able to
discuss the relevant
characteristics of the
word(s) being
compared.
Published State Standard: Patterns and Functions: Demonstrate and explain the difference
between repeating patterns and growing patterns. (See Ban, Holt, & Kurizaki, 2008).
Less Complex
More Complex
Proficient
Student can
describe a growing
pattern by using
objects, pictures
and numbers.
Student can use
appropriate
vocabulary to
describe the
growing pattern.
Student can describe
repeating AND growing
patterns by paying attention
to how each element in the
pattern relates to each
other.
Student can use
appropriate vocabulary to
explain/justify the growing
pattern.
Student can
describe repeating
AND growing
patterns. Student
can use appropriate
vocabulary to
explain/justify the
growing pattern.
Student can use
comparison/contrast
and cause-effect
language to
describe similarity
and differences
among patterns.
Difference between Content Standards and Achievement Standards
The Modified Academic Achievement Standards document (U.S. Department of
Education, 2007) defines academic content standards as ―statements of the knowledge and
skills that schools are expected to teach and students are expected to learn (p. 12-13).‖ These
content standards are mandated for all students, regardless of ability, and are meant to drive
instruction and assessments. These are the content standards discussed earlier, established at
the national or state level by teams of experts and stakeholders. On the other hand, academic
Considerations for an AA-MAS
Page 161
achievement standards ―are explicit definitions of how students are expected to demonstrate
attainment of the knowledge and skills of the content standards (p. 13).‖ They further state that
academic achievement standards must have the following elements:
At least three achievement levels, which are labels that convey the degree of
achievement in a given subject area (e.g., proficient, developing, not proficient, etc.)
Achievement descriptors, which are descriptions of content based competencies
associated with each of the achievement levels established (what students at each level
know and can demonstrate), and
Cut scores, which separate one level of achievement from another (how is a proficient
student different from a developing student, etc.)
More will be said about establishing achievement standards in Chapter 7 (Perie, this
volume). This differentiation is important, because within an AA-MAS system, ONLY the
academic achievement standards may be modified, NOT the content standards. In order for
assessments to provide meaningful information about students‘ academic progress and
promote accountability, there must be a clear alignment between the assessments and
academic content standards. This process has been discussed in much more depth by
Pelligrino (Chapter 4, this volume) with the discussion of the Assessment Triangle.
Barriers in Providing Access to the General Curriculum
Ideally, all students must have access to the general curriculum used in their school, and
be able to be assessed with the same assessment that all students use, in such a way as to
document learning through performance within a general assessment system. But, with the
broad diversity of skills and language that are typical in most schools in every state, this ideal is
very difficult to fully implement, which is why states are allowed alternate assessments for a
limited population of students. In the not so distant past, students with significant disabilities
were not allowed to attend school. When the IDEA was originally passed by Congress in 1975,
Considerations for an AA-MAS
Page 162
school-based placements were mandated, but these were generally done in segregated settings
within the school. More recently, students with disabilities (including those with significant and
mild disabilities) were able to be excluded from the general assessments used for other
students in the school who did not have documented disabilities. However, in the most recent
era of high-stakes assessments and No Child Left Behind legislation, school administrators can
no longer have free reign to exclude certain students from being counted in the standardized
assessment system. All students have to be included on Adequate Yearly Progress (AYP)
reporting to the state and federal governments.
These changes are helpful in that all students count in terms of AYP. Schools are no
longer able to exclude a student because of severe disabilities, and they must document that
the content being taught and learned in the school setting has direct or indirect links to the
general curriculum being studied by peers. In the past, some administrators thought of some
students in terms of a ―test score,‖ and tried to exclude any student thought to be detrimental in
bringing down the schools overall achievement level, which may impact their ability to meet their
AYP goals. Now schools must find ways to create meaningful links to the curriculum for
students with even the most significant disabilities. But this requirement also created
challenges, or barriers, that were brought to the fore after the new federal legislation was
implemented.
One barrier is the way that many teachers are trained, both at the in-service and preservice levels. In a nutshell, special education teachers (especially those working with the most
severe students) are not trained (with rare exceptions) to think about the general state
standards and required curriculum and are not shown how to link their teaching and student
learning directly to those standards, since students with disabilities were traditionally expected
to work on nonacademic skills, which are important in terms of day-to-day living. Many students,
including those with more mild intellectual disabilities, were generally functioning below grade
level. So much of the focus of instruction was spent on teaching and learning content standards
Considerations for an AA-MAS
Page 163
at lower grade levels, in addition to nonacademic content, such as life skills. On the other hand,
general education teachers and teacher candidates, who were generally much better versed in
understanding how their teaching was driven by the state standards expected within the
different content subjects, had little or no idea how to apply that information to students with
disabilities. In fact, it was common in most schools to find that general education and special
education teachers rarely, if ever, interacted professionally within a school setting. This is not
surprising, considering that these teachers are generally trained in segregated classes within
teacher training programs. Even in graduate schools, special education teachers rarely take
classes with general education peers. So the information necessary to successfully navigate the
new federal mandates were generally not shared among the two sets of teachers.
Why is this important? General education and special education professionals must find
ways to align their points of view, with general education teachers providing help in
understanding mandated grade-level standards for learning and special education teachers
bringing expertise in how to teach the necessary skills to achieve these standards to students
with disabilities. There is a synergy when special and general education teachers work together
that is not possible when they work in isolation. General education teachers are able to address
the ―What‖ questions surrounding student learning—what is meant by a state standard? What
are higher level thinking skills? Process writing skills? What does it look like in a classroom
setting? And special education teachers are able to better address the ―How‖ questions—how
do I teach a nonverbal student to understand algebraic principles? How can I teach a blind or
deaf student to read? Both kinds of expertise are needed in order to ensure that all students
have access to good teaching and learning. But still, in most teacher training schools and many
PK-12 schools, these teachers continue to work alone, with rare exceptions.
So, how do we create professional development models that break these barriers? What
processes need to be in place for this to happen? First, it is essential that teacher training
institutions consider developing teacher candidates with dual expertise in both general
Considerations for an AA-MAS
Page 164
education and special education. Regardless of what subject each individual teacher is
expected to teach, this dual expertise will be helpful, especially in a diverse society, where a
―typical‖ classroom includes students with intellectual disabilities, learning differences, language
differences, etc. Imagine programs where elementary and middle/secondary teacher candidates
work alongside special education teacher candidate colleagues. Imagine graduate programs
where candidates for general and special education degrees collaborate with colleagues in
educational administration and school counseling preparation programs, working through
common scenarios and learning how to cooperate in the world in which they will eventually be
expected to collaborate. For schools, this could open a broad door to professional development
opportunities, using case studies and real-life scenarios to promote school climates where all
teachers understand both the ―what‖ and the ―how,‖ regardless of their assigned grade level or
content expertise, and they are supervised and supported by professionals with similar training
and expertise.
In addition, these partnerships between general and special educators must be fostered
to allow for the design of meaningful assessments, including the tricky work of designing an
assessment that is less difficult but that maintains the same levels of breadth and depth.
General education teaching specialists should have the content knowledge related to the
domain being assessed, to be able to ensure that depth and breadth are maintained, and
special education teaching experts generally can address alternate (and perhaps less difficult)
ways of allowing students to ―show what they know.‖ In addition, these teams can help develop
learning maps and/or curricula that utilize best practices in teaching, thereby allowing for
alternate ways to teach students skills and processes related to content standards.
By working together and negotiating this fine line between content integrity and less
difficult assessments, the synergy mentioned earlier can be maintained. And these experts must
work with assessment experts, as well, with the general education teachers making judgments
about fidelity of assessment items to the learning domain, and special education experts making
Considerations for an AA-MAS
Page 165
judgments about accessibility to testing for students who may need supports in order to show
what they know.
The Content Standards for Mathematics
The majority of states have mathematics content standards that align to Principles and
Standards for School Mathematics, the content standards document published by the National
Council of Teachers of Mathematics (2000). This document presents key mathematics goals for
students in pre-K through twelfth grade. The document describes ten standards, five content
and five process standards, that represent a comprehensive and connected organization of key
mathematical understandings and competencies of what students should know and be able to
do. The content standards include number and operations, algebra, geometry, measurement,
and data analysis and probability.
The five process standards underscoring ways of acquiring and using mathematics
content knowledge include problem solving, reasoning and proof, communication, connections,
and representation. For the five content standards, broad goals are presented for all students
preK–12 with specific expectations explicated for the various grade bands: pre-K through grade
2, grades 3 through 5, grades 6 through 8, and grades 9 through 12. The following table
presents the goals for each of the content standards for all students grades pre-K through 12.
Table 5-1. Goals for Pre-kindergarten through Grade 12 for Five Content Standards
Number and
Operations
Understand numbers,
ways of representing
numbers,
relationships among
numbers, and
number systems;
Understand
meanings of
operations and how
they relate to one
another;
Compute fluently and
make reasonable
Algebra
Understand patterns,
relations, and
functions;
Represent and
analyze
mathematical
situations and
structures using
algebraic symbols;
Use mathematical
models to represent
and understand
quantitative
relationships;
Considerations for an AA-MAS
Geometry
Analyze
characteristics and
properties of twoand threedimensional
geometric shapes
and develop
mathematical
arguments about
geometric
relationships;
Specify locations
and describe spatial
relationships using
coordinate
Measurement
Understand
measurable
attributes of objects
and the units,
systems, and
processes of
measurement;
Apply appropriate
techniques, tools,
and formulas to
determine
measurements.
Data Analysis and
Probability
Formulate questions
that can be
addressed with data
and collect, organize,
and display relevant
data to answer them;
Select and use
appropriate statistical
methods to analyze
data;
Develop and
evaluate inferences
and predictions that
are based on data;
Page 166
estimates
Analyze change in
various contexts.
geometry and other
representational
systems;
Understand and
apply basic concepts
of probability
Apply
transformations and
use symmetry to
analyze
mathematical
situations;
Use visualization,
spatial reasoning,
and geometric
modeling to solve
problems.
Each of the five content standards is broken down by grade-level bands providing
greater specificity as to what is expected of students at that particular grade level. This
elaboration is provided for each of the goals listed in the above table. For example, the algebra
standard includes ―analyze change in various contexts‖ as one of the goals. This following
presents the expectations at various grade bands for this goal.
Table 5-2. Grade Band Expectations for the Algebra Standard
Pre-K through Grade 2
Describe qualitative
change, such as a
student's growing taller;
Describe quantitative
change, such as a
student's growing two
inches in one year.
Grade 3 through Grade
5
Investigate how a
change in one variable
relates to a change in a
second variable;
Grade 6 through Grade
8
Use graphs to analyze
the nature of changes in
quantities in linear
relationships.
Grade 9 through Grade
12
Approximate and
interpret rates of change
from graphical and
numerical data.
Identify and describe
situations with constant
or varying rates of
change and compare
them.
Similarly, the process standards for mathematics provide general guidelines that assist in
describing the types of mental processes that are inherent in a well-balanced mathematics
curriculum. Though more difficult to specify in terms of concrete and measureable behaviors,
the process standards present key ideas about what is valued in the discipline. The following
table lists the goals for each of the five process standards for pre-kindergarten through grade
12. The Principles and Standards for School Mathematics (NCTM, 2000) does not further
Considerations for an AA-MAS
Page 167
delineate grade band expectations for the goals. Later in this section, the process standards will
be revised in reference to their use in designing state assessments.
Table 5-3. Process Standards for Pre-K through Grade 12
Problem Solving
Build new
mathematical
knowledge through
problem solving;
Solve problems that
arise in mathematics
and in other
contexts;
Apply and adapt a
variety of
appropriate
strategies to solve
problems;
Monitor and reflect
on the process of
mathematical
problem solving.
Reasoning and
Proof
Recognize
reasoning and proof
as fundamental
aspects of
mathematics;
Make and
investigate
mathematical
conjectures;
Develop and
evaluate
mathematical
arguments and
proofs;
Select and use
various types of
reasoning and
methods of proof.
Communication
Organize and
consolidate their
mathematical
thinking through
communication;
Communicate their
mathematical
thinking coherently
and clearly to peers,
teachers, and others;
Analyze and
evaluate the
mathematical
thinking and
strategies of others;
Connections
Recognize and use
connections among
mathematical ideas;
Understand how
mathematical ideas
interconnect and
build on one another
to produce a
coherent whole;
Recognize and
apply mathematics
in contexts outside
of mathematics.
Representation
Create and use
representations to
organize, record, and
communicate
mathematical ideas;
Select, apply, and
translate among
mathematical
representations to solve
problems;
Use representations to
model and interpret
physical, social, and
mathematical
phenomena.
Use the language of
mathematics to
express
mathematical ideas
precisely.
Recognizing that states and local educational agencies were often challenged in
implementing rigorous assessment and accountability systems and to assist teachers in
identifying consistent priorities and focus, the NCTM (2006) developed Curriculum Focal Points
for Prekindergarten through Grade 8 Mathematics. In this document, NCTM asserts that:
…organizing a curriculum around these described focal points, with a clear emphasis
on the processes that Principles and Standards addresses in the Process
Standards—communication, reasoning, representation, connections, and,
particularly, problem solving—can provide students with a connected, coherent, ever
expanding body of mathematical knowledge and ways of thinking. (p.1)
Considerations for an AA-MAS
Page 168
It is clear that these documents offer a comprehensive picture of the domain of school
mathematics. It is further evident that such documents provide the core for developing
curriculum and informing instructional and assessment priorities.
Professional specialty organizations, such as the National Council of Teachers of
Mathematics and various government sponsored enterprises including the National
Mathematics Advisory Panel, provide an articulation of the content domain for mathematics.
This articulation is a broad framework providing state educational agencies with a launching
point from which to develop grade-level specific mathematics competencies. States use
different processes to develop academic standards for grade-level content. The resulting
documents become the critical focus as states develop assessments, including modified
achievement standards, to assess students‘ proficiency towards meeting those state
competencies.
Sampling Mathematics
Important mathematics that should be reflected in assessments includes ―both the
necessary content and the interconnectedness of topics and process‖ (Mathematical Sciences
Education Board [MSEB], 1993, p. 42). The National Assessment of Educational Progress
[NAEP] employs a new way to characterize the learning domain and the corresponding
assessment that utilizes a lattice structure allowing a more interconnected view of mathematics.
Since 1995, items reflect five content categories: number and operations; measurement;
geometry; data analysis, probability, and statistics; and algebra and functions. Also included are
mathematical abilities categories: conceptual understanding, procedural knowledge, and
problem solving. These ability categories are considered in the final stage of development to
confirm that there is balance among the three categories though not necessarily within each
content category (MSEB, 1993).
Considerations for an AA-MAS
Page 169
New York, for example, has test blueprints for mathematics that assess a range of
mathematics skills and abilities. The items are also aligned with one content-performance
indicator for reporting purposes and are also aligned to one or more process-performance
indicators (New York State Education Department, 2007a). The alignment to both content and
process strands is intended to provide tests which ―assess students‘ conceptual understanding,
procedural fluency, and problem-solving abilities, rather than solely addressing their knowledge
of isolated skills and facts‖ (p. 5). New York includes five content strands: number sense and
operations, algebra, geometry, measurement, and statistics and probability. The distribution of
score points across the strands was determined during specifications meetings with panels of
New York State educators during blueprint specifications meetings. The 2007 Blueprint, for
example, indicates that at grade 5 for the content strand of algebra that the target number of
points would be 5 (6 points were selected for the test) comprising 11% of the test (13% was the
percentage of items selected for the test).
The Content Standards for English/Language Arts
As mentioned at the beginning of this chapter, there are no specific grade-level content
standards in ELA at the national level, as there are in mathematics. So the manner of teaching,
learning, and assessing ELA will, not surprisingly, also be different. Part of the issue was
discussed earlier in Chapter 4 (Pellegrino). While it is fairly easy to observe and draw inferences
about how a child might develop skills in and learn long division, it is much more difficult to
make inferences about how a child might learn to comprehend information and, as also
mentioned earlier, even these inferences from commonly used measures can be misleading.
This is likely one reason why there is less consensus among experts about how ELA develops,
than there is in mathematics. Rather, there are the general recommendations from the
NCTE/IRA publication mentioned earlier, which are somewhat outdated but still sound.
Considerations for an AA-MAS
Page 170
It might be easy to assume that the ELA standards across different states are quite
different, especially compared to areas like math where there are national guidelines for
standard setting. But an examination of ELA standards by grade level and across different
states reveals that they are actually comparable. Part of the reason for this general consensus
relates to the extensive research that is available in ELA, outlined in more detail in Chapter 4.
So there does tend to be much general consensus, framed by this research, about the
standards that need to be taught in different grades, with an initial focus on ―learning to read‖
being gradually replaced by an emphasis on ―reading to learn.‖
This distinction is important, and first surfaced in the early 1970s (Herber, 1970). In the
first few years of school, especially at the preschool through second grade levels, much ELA
instruction is focused on ―learning to read,‖ which involves learning the requisite skills that lead
to more complex reading skills, in phonics, phonemic awareness, etc., and also developing
fluency practicing and using these skills on both narrative and informational texts. As students
begin to move into third grade, in general, and even more dramatically as they move into the
middle grades and then into high school, the emphasis shifts further and further away from
these core basics, since the assumption is that they have been taught and learned in the earlier
grades. The emphasis then shifts to ―reading to learn.‖ This means that students become more
and more accountable for using the earlier developed skills to read, in order to learn content
information—mathematics, science, social studies, etc. While there is not a clean cut break
between the two, and there is indeed much overlap as students work to learn to read and read
to learn at the same time, as students progress to the higher grades, the skills expected to be
learned become much more intangible, making them more difficult to assess. While it is fairly
simple to teach and assess a student‘s ability to put letters and sounds together to make words,
it is much more difficult to teach and assess a student‘s ability to utilize higher order thinking
and critical comprehension skills. So the task of developing content standards, curricula and
assessments tends to be much more straightforward in the earlier grades, where the skills can
Considerations for an AA-MAS
Page 171
be demonstrated in a much more concrete way, and much more difficult as students move into
the higher grade levels, and documenting learning of the skills becomes much less of a
concrete process (as Pellegrino discussed in Chapter 4, this volume).
Much attention has been paid to the report of the National Reading Panel (2000)
(www.nationalreadingpanel.org). For instance, the findings of the panel were used in
establishing the Reading First program, a $5 billion dollar initiative introduced during the latest
Bush administration. Many curriculum materials and school professional development activities
are developed around the ―Big Five‖ or the ―Essential Five‖ skills highlighted in the report—
phonemic awareness, phonics, fluency, vocabulary, and comprehension. Some advocates of
the report suggest that these ―big five‖ are the only skills that are ―research based,‖ and so
these are the only ones that should be a part of the state ELA standards.
Timothy Shanahan, a member of the National Reading Panel, tried to dispel myths
surrounding the panel report (2003), discussing both what the report said and what it did not
say. He stated that, in determining which areas to study in the report, ― . . . we arrived at more
than 30 topics that we thought might merit review—and even that list was not complete with
regard to all topics that have been researched or that have been discussed as having potential
importance (p. 649–650).‖ Some of these topics were not studied because the panel felt that
they had been adequately reviewed elsewhere in the professional literature. Some were not
studied because the panel felt that there was not enough evidence (previously published
research) available to include it within the framework of the meta-analysis. Some specific
studies were not included in the report, even though the topic of the study WAS included in the
report, because the studies did not meet the criteria established by the panel for inclusion in the
meta-analysis, not because, as some experts suggest, the research was not scientific in nature.
In terms of how the work of the National Reading Panel influences state standards in
ELA, it is safe to say that the ―Big Five‖ should certainly be included, since there is strong
evidence that they are indeed essential for allowing students to access the curriculum in other
Considerations for an AA-MAS
Page 172
content subjects. However, it is not safe to assume that if a topic was not included in the report,
there is no scientific evidence that it is worth learning.
Aligning State Standards to Assessments
Using curriculum documents, such as content maps, developed at the state level,
assessments are designed to represent the content domain. Content match and depth match
are two dimensions on which to consider how the assessments align to the curriculum
(LaMarca, 2001). Content match has to do with the degree to which the assessment content
matches to the subject area content, as identified in the state academic standards. Content
match may be further delineated through analysis of broad content coverage, range of
coverage, and balance of coverage. Broad content coverage, or the categorical congruence of
the assessment, addresses whether the test content links to the broad content standards.
Range of coverage asks whether the test items address the specific objectives related to each
content standard. Balance of coverage is concerned with whether the assessment items reflect
the major emphases and priorities found within the academic standards. The second dimension
of alignment, depth match, is related to the degree to which the test items match the skills and
knowledge specified in the state‘s academic standards in terms of cognitive complexity. Once
items are developed, there should be a systematic analysis of the alignment that includes a
determination of what objective an item measures and the degree of cognitive complexity for
that item (LaMarca, 2001; Webb, 1999).
Guidance on Test Specifications
Ketterlin-Geller (2008) proposes a model of assessment development which extends the
assessment model created by the National Research Council (Pellegrino, Chudowsky, &
Glaser, 2001) in order to better meet the needs of students with cognitive disabilities. KetterlinGeller‘s model is motivated by the concept of universal design as applied to educational testing.
Students with cognitive disabilities may not interact with a test in the same way as general
Considerations for an AA-MAS
Page 173
population students. This causes ―construct-irrelevant variance‖ that prevents an accurate
assessment of ―domain-specific knowledge‖ (Ketterlin-Geller, 2008, p.4). Universally designed
tests assess the same constructs but have flexibility in the format or delivery of the test, thus
rendering them more useful and accessible to a greater percentage of the student population.
Ketterlin-Geller (2008) argues that ―applying the principles of universal design to academic
assessments provides a mechanism for reducing the impact of construct-irrelevant variance on
test-score interpretation, thereby, increasing the validity of the uses of test results‖ (p.4).
Assessment must consider the interaction between observation, cognition, and
interpretation in the assessment design (Pellegrino, Chudowsky, & Glaser, 2001). KetterlinGeller (2008) elaborates on some basic ideas on how this model informs the design of
assessment tools. In this model, within the assessment triangle detailed in Chapter 4, the
cognition aspect represents the theories and beliefs of learning within the domain. Cognitive
models should reflect the ways that children learn content within the targeted domain. Such
targets include broad constructs such as analytic reasoning as well as narrow components such
as the unit of length in mathematics. The observation aspect involves collecting student
behaviors, which become the basis for interpretations about the cognitive targets. The features
of assessments should reflect and align with the construct. Students with significant disabilities
may interpret and respond to items as a result of their disability contributing to constructirrelevant variance. Assessment features such as test platform, item format, problem context,
administration procedures, and scoring systems should be considered when determining the
characteristics of assessment tools. The interpretation aspect grounds decisions made about
student skills and knowledge in the domain. Student characteristics, the cognitive model, and
the observational tool interact in ways that influence the interpretation of student performance.
Failure to consider these interactions may lead to problems with the validity of score-based
interpretations.
Considerations for an AA-MAS
Page 174
Part of creating a universally designed test is to incorporate a cognitive task analysis for
each test item (alongside the content of the domain targeted). Ketterlin-Geller proposed a
cognitive task analysis along four levels of cognitive engagement: knowledge and application of
general facts and procedures, knowledge and application of concepts and procedures, strategic
thinking, and extended thinking. Furthermore, delineation between target and access skills
should be clear. Target skills include both the cognitive and content components that the test is
designed to actually measure. Access skills, on the other hand, include cognitive and content
components that are needed to attain the target skills, but which the test is not designed to
measure. Explicitly analyzing and articulating the cognitive tasks underlying a given problem
can lead to better test accommodations for students with cognitive disabilities. For instance, a
cognitive task analysis for a given mathematics problem may reveal that a test taker must be
familiar with the concept of a calendar. If familiarity with a calendar is classified as an access
skill, and the student has a limited concept of a calendar, an accommodation may be made (e.g.
eliminate calendar reference or include an explanation of concept of calendar in problem).
As the assessment system is further developed, review of items for assessments should
follow a structured protocol and should be reviewed by content and grade-level experts. Such
review should be sensitive to the interaction between cognition, observation, and interpretation.
Item review might include the following elements:
accuracy and grade-level appropriateness
mapping of the items to performance indicators
accompanying exemplar responses (for constructed-response items)
appropriateness of the correct response and distracters
conciseness, preciseness, clarity, and readability
existence of ethnic, gender, regional, or other possible bias. (NYSED, 2007a, p.16)
Considerations for an AA-MAS
Page 175
Such procedures are imperative, particularly for alternate assessments based on
alternate and modified achievement standards, so that students have access to the range of
academic content specified in the state‘s academic grade-level curriculum. Procedures minimize
clustering of assessment items in isolated content strands and further guarantee that
assessments are not over reliant on items that align to processes such as procedures in
mathematics and decoding skills in reading.
It is essential that reviews consider the interpretation aspect of the assessment model.
Messick (1989) puts forth the idea of a unified view of validity which takes into consideration the
ethical underpinnings of the test interpretation and use. He argues that the way content validity
has been defined, as ―evidence in support of the domain relevance and the representativeness
of content of the test instrument‖ (p.7) does not consider the inferences that may be made from
the test. Messick argues that ―we must inquire whether the potential and actual social
consequences of test interpretation and use are not only supportive of the intended testing
purposes, but at the same time are consistent with other social values‖ (p.8). Only through
systematic and comprehensive analysis of the assessment program will all of the issues related
to the assessment model (observation, cognition, and interpretation) be an integral part of the
test design process.
Judging the Alignment between Expectations and Modified Assessments
Webb (1997) offers five categories for judging the alignment between expectations and
assessments. The first, content focus, states that the focus should consistently be on
developing students knowledge of content. This consistency is primary emphasized in four
components: categorical concurrence, depth of knowledge consistency, range of knowledge
correspondence, and balance of representation.
1. Categorical concurrence allows for differences in the level of detail but expects the same
categories of content (such as content headings and their subheadings) to appear in the
Considerations for an AA-MAS
Page 176
expectations and the assessment. For example, an assessment in mathematics would
need to reflect the five content strands from the NCTM.
2. Depth of knowledge consistency can vary on a number of dimensions, such as level of
cognitive complexity, and describes how well students should be able to transfer the
knowledge to different contexts and how much prerequisite knowledge is necessary in
order to understand more difficult concepts. For example, the New York grade 5 core
mathematics curriculum, for standard indicator 5.A.7, states that the student will ―Create
and explain patterns and algebraic relationships (e,g.,2,4,6,8...) algebraically: 2n
(doubling)‖ (New York State Education Department, 2006). If students are only required
to identify the next item in the pattern, the depth of knowledge is not aligned for this
performance indicator. The learning map related to figurative language presented earlier
in this chapter provides another example of how the depth of knowledge can vary across
content achievement standards.
3. Range of knowledge correspondence refers to the degree to which expectations and
assessments cover comparable topics and ideas within categories. For example, the
New York (New York State Education Department, 2007b) grade 4 performance
indicators for ELA include the following: ―Standard 2: Students will read, write, listen, and
speak for literary response and expression: Make predictions, draw conclusions, and
make inferences about events and characters.‖ Assessments that only focus on
prediction would not meet the range of knowledge correspondence, since the remainder
of the skills are left out. While it is tempting to create learning maps related to
achievement standards that are additive (i.e., if student can make predictions, they are in
the developing level, if they can make predictions AND draw conclusions, they are at the
target level, etc.), this methodology does not adhere to the intent of the content
standard, which requires knowledge of all of the skills mentioned.
Considerations for an AA-MAS
Page 177
4. Balance of representation means that similar emphasis is given to different content
topics, instructional activities, and tasks. Assessments must reflect shifts in emphasis in
content. The following visual from the National Council of Teachers of Mathematics
(2000) emphasizes this shifting emphasis for the five content strands. Typically alternate
assessments have focused on number and measurement even for students at the
middle and secondary levels. Such emphasis would not be consistent with the shifting
emphasis in content. Similarly, in language arts, the example stated previously relating
―learning to read‖ and ―reading to learn‖ document this shifting emphasis across grade
levels.
Figure 5-1. Shift in Emphasis in Content Pre-K through Grade 12
The second category for determining alignment is articulation across grades and ages.
Expectations and assessments should reflect views about how students develop and learn at
different stages. This view includes ‗cognitive soundness as determined by best research and
understanding‘ and ‗cumulative growth in knowledge during students‘ schooling‘. The underlying
structure of knowledge in content domains influences how instructional experiences for students
should be organized. Specialty professional associations such as the National Council of
Teachers of Mathematics, the National Council of Teachers of English, and the International
Reading Association exist as organizations to support the articulation of this body of research
and understanding.
The third component is equity and fairness. Assessments that align to this criterion
provide every student with a reasonable opportunity to demonstrate their level of attainment
Considerations for an AA-MAS
Page 178
relative to what is expected. High expectations are reflected in all learning standards. Multiple
forms of assessment provide a better alignment based on students‘ level of knowledge, culture,
social background, and experiences.
The fourth category is pedagogical implications. Classroom practice is related to the
learning of students. Review of assessments must consider the implications for classroom
practice. Teachers might be asked to interpret expectations and assessments and consider how
their classroom practices fit with their interpretations. Two critical elements to consider when
taking pedagogical influences into account are the active engagement of students in learning
and effective classroom practice including the use of technology, materials, and tools.
Assessment is not a stand-alone component of educational practice. Curriculum, instruction,
and assessment should be linked in a coherent and meaningful fashion. Further, assessment is
an ongoing process that should inform instruction; therefore, effective practices align all three
components so that student learning is promoted as a coherent whole.
The fifth category for determining alignment of expectations and assessments is system
applicability. Programs must be realistic and manageable. Policy must be constructed so that it
is applicable to teachers and administrators in their day-to-day efforts and not present an
additional burden outside of what is considered ―normal‖ school activities.
Considering Cognitive Complexity in Assessments
The Council of Chief State School Officers recognizes three models for evaluating the
alignment between curricular expectations and assessments: Webb‘s alignment model, the
Surveys of Enacted Curriculum model, and the Achieve model (Roach, Niebling, & Kurz, 2008).
The Webb alignment model is the primary model that has been applied to alternate
assessments and will be the focus of this discussion (see also Chapter 9 by Abedi); however,
Surveys of Enacted Curriculum provides an additional framework for considering cognitive
complexity and is also described in the following paragraphs.
Considerations for an AA-MAS
Page 179
The Surveys of Enacted Curriculum [SEC] includes a common language framework for
examining the content and visual displays of alignment analysis (see Porter & Smithson, 2001).
The common language framework provides general categories under which a series of topics is
organized. For example, addition and subtraction of whole numbers would be under the larger
category of ―Operations‖. Other topical content categories for K-12 include number
sense/properties/relationships, measurement, consumer applications, basic algebra, advanced
algebra, geometric concepts, advanced geometry, data displays, statistics, probability, analysis,
trigonometry, special topics, functions, and instructional technology. Content areas for reading
and language arts for K-12 include phonemic awareness, phonics, vocabulary, awareness of
text and print features, fluency, comprehension, critical reading, author‘s craft, writing
processes, writing components, writing applications, language study, listening and viewing, and
speaking and presenting. All content areas will not be present at every grade level. Comparing
the content categories with levels of cognitive demand (see tables below) allow for a coarsegrain view of what students are expected to do with their knowledge of content. A fine-grained
view breaks the content into more discrete descriptions. Algebra, for example, includes such
components as absolute value, multi-step equations, factoring, etc.
Table 5-4. Surveys of Enacted Curriculum Cognitive Demand Categories for Mathematics
Memorize
Recite basic
mathematical
facts
Recall
mathematics
terms and
definitions
Recall formulas
and computational
procedures
Perform Procedures
Use numbers to count,
order, denote
Do computational
procedures or
algorithms
Follow procedures /
Instructions
Solve equations/
formulas/routine word
problems
Organize or display
data
Read or produce
Considerations for an AA-MAS
Demonstrate
Understanding
Communicate
mathematical ideas
Use
representations to
model
mathematical ideas
Explain findings
and results
from data analysis
strategies
Develop/explain
relationships
between
concepts
Conjecture,
Generalize, Prove
Determine the truth
of a mathematical
pattern or
proposition
Write formal or
informal proofs
Solve Non-routine
Problems, Make
Connections
Apply and adapt a
variety of
appropriate
strategies to
solve non-routine
problems
Recognize,
generate or
create patterns
Apply mathematics
in contexts outside
of mathematics
Find a mathematical
rule to generate a
pattern or number
sequence
Analyze data,
recognize patterns
Make and
Synthesize content
and ideas from
several sources
Page 180
graphs and tables
Execute geometric
constructions
Show or explain
relationships
between models,
diagrams, and/or
other
representations
investigate
mathematical
conjectures
Identify faulty
arguments or
misrepresentations
of data
Reason inductively
or deductively
Table 5-5. Surveys of Enacted Curriculum Cognitive Demand Categories for ELA/Reading
Memorize/Recall
Reproduce sounds
or words
Perform
Procedures/
Explain
Follow instructions
Give examples
Provide facts, terms,
definitions,
conventions
Locate literal
answers in text
Identify relevant
information
Check
consistency
Generate/
Create/
Demonstrate
Create / develop
connections
among text,
self, world
Recognize
relationships
Summarize
Identify purpose,
main ideas,
organizational
patterns
Analyze/
Investigate
Evaluate
Categorize /
schematize
information
Determine
relevance,
coherence, internal
consistency, logic
Distinguish fact and
opinion
Dramatize
Compare and
contrast
Order, group,
outline,
organize ideas
Identify with
another's point
of view
Express new
ideas (or
express ideas
newly)
Make inferences,
draw conclusions
Describe
Gather
information
Assess adequacy,
appropriateness,
credibility
Test conclusions,
Hypotheses
Synthesize content
and ideas from
several sources
Generalize
Predict probable
consequences
Critique
Develop
reasonable
alternatives
Integrate with
other topics
and subjects
SEC involves raters, including individual teachers and an alignment panel of three or
more content area specialists. Teachers complete surveys at the end of the year, rating level of
coverage for topics and subtopics and the level of cognitive demand for tasks in each of the
topic areas. The model provides useful descriptors of cognitive demand that can serve as a
guide in considering the design of assessments.
Application of Webb‘s model requires members of a trained alignment panel, consisting
of educators and curriculum experts, to:
Considerations for an AA-MAS
Page 181
1. Recognize and apply depth-of-knowledge (DOK) level rating for each objective in the
state content standards.
2. Rate the DOK level for each assessment task.
3. Identify the objective(s) from the content standards to which the assessment item
corresponds.
The central feature of this model is the Depth of Knowledge rating given to each
assessment item. There are four depth-of knowledge levels: recall, skill/concept, strategic
thinking, extended thinking. Once this task is completed, analysis of the ratings allow for
computing descriptive statistics for each of the four criteria in Webb‘s alignment model:
categorical concurrence, range of knowledge, balance of representation, and depth of
knowledge which were described earlier in this chapter.
Table 5-6. Webb‘s General Descriptions for Depth-of-Knowledge Level
Level
Level 1: Recall
Description
Recalling information such as facts, definitions,
terms, or simple procedures; performing simple
algorithms or applying formulas
Level 2: Skill/Concept
Requires some decision as to how to approach a
problem or activity; classifying, organizing,
estimating, making observations, collecting and
displaying data, comparing data
Level 3: Strategic Thinking
Requires reasoning, planning, using evidence, and a
higher level of thinking than recall or skill/concept;
Explaining one‘s thinking, making conjectures,
determining solutions to a problem with multiple
correct outcomes
Level 4: Extended Thinking
Requires complex reasoning, planning, developing,
and thinking often over an extended period of time.
Cognitive demand for tasks is high and work is
complex. Requires making connections within and
between subject domains. Includes designing and
conducting experiments, making connections
between a finding or outcome and related concepts,
combining and synthesizing ideas into new
concepts, critiquing literary pieces and designs of
experiments.
These models provide useful descriptors for developing modified achievement standards
and alternative assessments based on those standards. The descriptors can also guide
Considerations for an AA-MAS
Page 182
assessment development and ensure the assessments cover the same breadth and depth of
content.
Levels of Cognitive Complexity in Mathematics
State assessment frameworks articulate performance indicators listed for content
strands and are intended to provide teachers with guidance in determining the outcomes of
instruction. The following discussion illustrates how state standards can be addressed through
items with different depth of knowledge levels with the items still directly related to the content
standard.
The New York Mathematics, Science, and Technology: Standard 3 states ―Students will:
understand the concepts of and become proficient with the skills of mathematics;
communicate and reason mathematically;
become problem solvers by using appropriate tools and strategies;
through the integrated study of number sense and operations, algebra, geometry,
measurement, and statistics and probability‖ (NYSED, 2005).
The NYSTP Mathematics Tests is designed to assess students on the content and
process strands of this standard. Items are aligned to one content-performance indicators, but is
also aligned to one or more process-performance indicators as appropriate for the concepts that
are embodied in the task (NYSED, 2007a) though the procedure used for determining alignment
to the process performance indicators is not described in the technical documents.
An Illustration Based on Standard 3 for Grade 5 Mathematics. Consider an example
based on the NY Standard 3 for grade 5 mathematics. The Geometry Strand includes the goal
that ―Students will apply coordinate geometry to analyze problem solving situations‖ and more
specifically to ―plot points to form basic geometric shapes (identify and classify)‖ which is
indicator 5.G13 (NYSED, 2005).
Considerations for an AA-MAS
Page 183
The guiding principle is that assessment tasks must be aligned to content for the gradelevel standards. This indicator requires that students demonstrate skills and understanding of
coordinate geometry by plotting points in the context of identifying and classifying basic
geometric shapes. The indicator contains multiple content-related targets suggesting that a
modified standard could be constructed by breaking relevant tasks into multiple components.
Depth of knowledge can be considered by changing the complexity of assessment tasks related
to the indicator. How specific assessment items contribute to a student‘s level of proficiency is
discussed later in this chapter.
The following illustration would include the appropriate information displayed in a
coordinate grid such as the one that follows:
Depth of knowledge illustrations are presented for each of the four levels offered by Webb.
1. Recall and Reproduction. Present student with the following points graphed on a
coordinate plane: A (1, 5), B (3, 2), C (6, 2), and D (?, ?). Students are asked to give the
coordinates for point D. Next, present a similar diagram with points A (1, 1), B (1, 5), C (5, 5),
and D (5, 1) connected to form a quadrilateral. The student is required to identify the type of
quadrilateral formed by connecting the points. Students might be given possible choices such
as square, rectangle, and trapezoid.
2. Skills and Concepts/Basic Reasoning. Presented with a coordinate grid, students are
asked to plot points A (1, 1), B (1, 5), C (5, 5), and D (5, 1). They are then asked to connect A,
B, C, and D in order and to identify the quadrilateral which is formed. Students might be asked
Considerations for an AA-MAS
Page 184
to explain or describe how they determined the type of quadrilateral that was formed. The
assessment item might include scaffold to guide the students in this process such as ―Describe
the characteristics of the sides and angles that helped you decide what type of figure was
formed‖.
3. Strategic Thinking/Complex Reasoning. The student is asked to plot points A (1, 1), B
(1, 5), and C (5, 5). They are then asked to plot point D such that the figure formed by
connecting the points A, B, C, and D, in order, forms a rectangle. Name the coordinates for
point D. Give two reasons why the figure has to be a rectangle.
4. Extended Thinking/Reasoning. The student is asked to plot point A (1, 1). Students
are then asked to plot three additional points and connect them such that the figure formed is a
rectangle. To extend their thinking, the student is asked to describe a process for forming a
rectangle in a coordinate grid given one point as a vertex. Instead of a rectangle, the student
might be asked to discuss the process for constructing a trapezoid given one point as a vertex.
Item Modifications
Additional modifications can be accomplished by changing the format of the assessment
items, reducing the complexity of the language used in the item, and providing additional
information or scaffolding to reduce the cognitive load for the student; however, the items must
maintain alignment to the grade-level content in the standard, as discussed earlier. Hess,
McDivitt, and Fincher (2008) conducted a pilot study about the effects of providing scaffolds for
students within test items and across state assessments, to see if these scaffolds allowed
students to better document knowledge that they possessed related to content standards. Some
of the scaffolds that they studied included restricting the use of pronouns, using graphic
organizers, chunking or segmenting longer texts into shorter pieces, left justifying text,
shortening or simplifying test item stems, adding graphics to illustrate a term, and paying
attention to the physical presentation of the assessment material by examining typeface,
Considerations for an AA-MAS
Page 185
spacing on the page, line length, and the use of blank space (or leading) around paragraphs or
between columns of numbers to make them more legible. A goal of these modifications was to
allow students access to the information on the assessment without cuing them to correct
answers. These methods, if effective, could meet the AA-MAS guidelines of ―less difficult‖ but
adhering to the fidelity of the grade-level standard. In this study, teachers were also asked to
use the scaffolding supports in their lessons, so that students were used to seeing them before
the actual assessment. The results of this pilot study indicate that providing scaffolding that
supports both teaching and assessments could provide a valid way to assess students on the
AA-MAS test. While the scaffolds discussed are research based, there are some inconsistent
findings about their effectiveness in improving student performance. For instance, Abedi et al.
(2008) found that students with disabilities did not perform significantly better on reading
comprehension assessments that utilized segmented text, although the reliability of the tests
improved. Further studies and appropriate field testing are necessary to justify the choice and
use of proper scaffolds at the state level.
Some more specific examples of scaffolding are shown in the following:
Common Stimulus. The approach of a common stimulus has been used on the National
Assessment of Educational Progress (NAEP; Kenney, 2000). The common stimulus might be a
table, graph, or chart. A series of items draws from previously presented information or a
common context. Anderson and Morgan (2008) offer some guidelines for constructing items that
have a common stimulus:
Items should be independent. A student‘s response on one question should not be
dependent on getting the correct answer for a previous item.
Items should refer to a clearly different aspect of the stimulus to avoid overlap.
Items should assess a range of skills.
Items should have a range of difficulty with the easier items appearing first.
Considerations for an AA-MAS
Page 186
Information given in a stem or answer choice should not assist the student in correctly
answering another item.
Items should appear on the same page or on a facing page.
Such simple approaches reduce the cognitive load required to comprehend and process
multiple items presented with varying contexts or information.
Replace text with relevant pictures, diagrams, tables, graphics. In the following diagram,
the student is presented with a diagram that may assist them in visualizing the relationship
between the sides of the scale and representing that relationship symbolically. This NAEP item
received mostly exemplary comments from raters because of the scaffolding provided by the
visual; however, one rater argued that the item was inauthentic and imposed as a testing
convention since the item could be solved by visual inspection and did not require the
construction of a number sentence (Daro, Stancavage, Ortega, DeStefano, & Linn, R., 2007).
Despite this criticism, the item demonstrates how visuals might assist a student in
understanding the relationship and thus able to focus on the task of identifying an equivalent
symbolic representation. The diagram did not change the underlying skill of being able to
identify a relationship symbolically.
Reduce complexity of stem. The following question and its modification (Elliott, Kurz,
Beddow & Frey, 2009) illustrates how items might be modified to reduce the complexity of the
Considerations for an AA-MAS
Page 187
item stem while preserving the alignment to the content standard. The modified item requires
the student to evaluate a formula to find the area of a complex figure consisting of a rectangle
and a triangle. The revised format removes the requirement that the student either recalls the
appropriate formula or identifies it from the list provided in the booklet. Both items assess a
student‘s ability to evaluate a geometric formula using data from a figure. Both items also
require the student to differentiate between a rectangle and a triangle and to understand the
basic concept of area. The modified item removes extraneous information, i.e. adjacent, while
also maintaining depth of knowledge.
Considerations for an AA-MAS
Page 188
Hess, McDivitt, and Fincher (2008) provide a similar example of simplifying the stem and
making the distractors complete sentences, in the following example.
Original Stem
The United States eventually reduced the number of
immigrants allowed to enter the country because:
A.
B.
C.
D.
the United States already had too many people.
the immigrants were taking away jobs from
American workers.
the immigrants had too many hardships to face in
America.
the country that the immigrants came from was
angry about their leaving.
Modified Stem and Distractors
Why did the United States reduce the number of
immigrants?
A.
B.
The United States already had too many people.
The immigrants were taking away jobs from
American workers.
C. The immigrants had too many hardships to face in
America.
D. The country that the immigrants came from was
angry about their leaving.
Grouping questions by content or topic. Clustering items according to the particular
standard that they link back to or by specific learning targets or objects is another way of
modifying the assessment without changing the content focus of the item. For example if an
assessment in mathematics has four items that link to the state standard ―Apply ratios and
proportions in solving real-world problems‖ then clustering those items (preferably by level of
difficulty with easier items first) will facilitate the student being focused on specific content for a
longer period of time without having to adjust to changes in content from item to item. The
student will be working with similar thinking processes as proportional reasoning is applied to
situations such as rates of change, percentages, unit pricing or rates, etc. Such measures
reduce anxiety and may generate more interest thus improving concentration.
Asking understanding questions. Another set of modifications involves breaking complex
tasks into components which may contain hints or supports to assist the test taker. A longer task
may be broken into parts that are matched to a specific indicator or expectation (Suurtamm,
Lawson, & Koch, 2008). Suurtamm and her colleagues do warn, however, that such
modifications potentially lead to specific approaches to solving a problem and may diminish
students‘ opportunities to participate in complex problem solving. In modified assessments such
practices are likely to reduce the overall complexity of the problem-solving situation while
retaining a link to the content standard.
Considerations for an AA-MAS
Page 189
Another aspect of this type of modification is providing hints. For example if a student is
asked to define a compound word (―noninterference,‖ for example), they might be prompted to
―Break the word into parts.‖ Hess, McDivitt, and Fincher (2008) show how thought balloons,
similar to what you might see in a comic strip, might be used to provide these hints.
These illustrations demonstrate that item modifications can be made while preserving
the fidelity of the content. Such modifications reduce cognitive load and simplify language
features that sometimes obscure the intent of the assessment item. Simplifying language
features is important in making assessment items accessible to a larger population of students
including those with learning difficulties and those for whom English is not their native language
(also see the chapter by Abedi). Scaffolding and related practices are good instructional tools
and should not only be used during assessments. Remember that there should be a coherent
link between the curriculum, instruction, and assessment.
How Do We Link Content to Curriculum and Instruction Appropriate for this Population?
Curriculum access, data collection, and instructional effectiveness are issues that have
been identified as variables that potentially influence student outcomes (Spooner, Dymond,
Smith, and Kennedy, 2006). As emphasized throughout this chapter, linking assessment to
content standards increases the likelihood that students with learning difficulties will have
access to relevant grade-level academic content. The importance of curriculum access has
been the focus of this chapter. Continued monitoring through data collection and analysis of
student performance will provide greater alignment between instruction and assessment
outcomes. Linking the curriculum and instruction to assessment outcomes is crucial in focusing
the instructional design system on planning, implementing, and assessment of student learning.
The teacher is a critical factor in linking curriculum and instruction. Browder, Karvonen,
Davis, Fallin, and Courtade-Little (2005) found that when teachers are trained on sound
instructional practices, students‘ scores on alternate assessments improved. Fuchs et al. (2008)
Considerations for an AA-MAS
Page 190
identified seven instructional principles that promote mathematical learning for students with
disabilities. These instructional principles also provide an important set of guidelines for
application to other content domains. First, instructional explicitness refers to instruction in
which the teacher provides explicit and didactic teaching sharing information focused on the
goals of instruction. The authors report that a meta-analysis of 58 math studies show that while
developing students advance from programs with constructivist and inductive styles that
students with mathematics difficulties do not profit in meaningful ways . Second, instructional
design to minimize the learning challenge anticipates and eliminates misunderstandings with
precise details, and the utilization of intentionally sequenced and integrated instructions focused
on addressing gaps in achievement. The use of learning tools such as manipulatives and
visuals enhance mathematics instruction, reducing confusion and the inability to maintain
content. Third, a strong conceptual basis situates the procedures being taught in order to
provide a strong conceptual foundation. Fourth, drill and practice is critical to maintaining skills
through daily lessons, review, and computerized supports. Fifth, cumulative review reinforces
practice and review, building a continued reliance on foundational skills being taught. Sixth,
instruction must include motivators to help students regulate their attention and behavior and to
work hard integrating systematic self-regulation and motivation supports including tangible
reinforcers. The seventh principle, considered the most essential, is ongoing progress
monitoring to establish whether a treatment is effective for a particular student.
In the next chapter, these ideas regarding modifying the content will be actualized into
test design theory. However, it is important to remember that the best test design will not
produce the desired results if the understandings about human cognition applied to the item
development are not also carried into the classroom through curriculum and instruction.
Considerations for an AA-MAS
Page 191
References
Abedi, J., Kao, J.C., Leon, S., Sullivan, L., Herman, J.L., Pope, R., Nambiar, V., &
Mastergeorge, A.M. (2008). Exploring factors that affect the accessibility of reading
comprehension assessments for students with disabilities: A study of segmented text.
Los Angeles: National Center for Research on Evaluation, Standards, and Student
Testing. [on-line] Available: http://www.cse.ucla.edu/products/reports/R746.pdf
Anderson, P., & Morgan, G. (2008). Developing tests and questionnaires for a national
assessment of educational achievement. Washington, DC: The World Bank.
Ban, P., Holt, L., & Kurizaki, V. (2008). Hawaii progress maps. Presentation made at the Council
of Chief State School Officers Conference, National Conference on Student Assessment
Resources, Orlando, FL.
Browder, D.M., Karvonen, M., Davis, S., Fallin, K., & Courtade-Little, G. (2005). The impact of
teacher training on state alternate assessment scores. Exceptional Children, 71, 267282.
Browder, D.M., Spooner, F., Wakeman, S., Trela, K., & Baker, J. (2006). Aligning instruction
with academic content standards: Finding the link. Research & Practice for Persons with
Severe Disabilities, 31(4), 309-321.
Department of Education (April, 2007). Modified academic achievement standards. Washington,
DC: Authors.
Daro, P., Stancavage, F., Ortega, M., DeStefano, L., & Linn, R. (2007). Validity study of the
NAEP Mathematics Assessment: Grades 4 and 8. Washington, DC: NAEP Validity
Studies Panel, U.S. Department of Education.
Elliott, S. N., Kurz, A., Beddow, P., & Frey, J. (2009). Cognitive load theory and universal design
principles: Applications to test item development. Paper presented at the annual meeting
of the National Association of School Psychologists, Boston, MA.
Fuchs, L.S., Fuchs, D., Powell, S.R., Seethaler, P.M., Cirino, P.T., & Fletcher, J.M. (2008).
Intensive intervention for students with mathematics disabilities: Seven principles of
effective practice. Learning Disability Quarterly, 31(2): 79–92.
Herber, H.L. (1970). Teaching reading in the content areas. Englewood Cliffs, NJ: Prentice-Hall.
Hess, K., McDivitt, P., & Fincher, M. (2008). Who are the 2% and how do we design test items
and assessments that provide greater access to them? Results from a pilot study with
Georgia students. [online] available at:
http://www.nciea.org/publications/CCSSO_KHPMMF08.pdf
International Reading Association and National Council of Teachers of English. (1996).
Standards for the English Language Arts. Newark, DE and Urbana, IL: Authors.Kenney,
P. A. (2000). Families of items in the NAEP mathematics assessment. In N. S. Raju, J.
W. Pelligrino, M. W. Bertenthal, K. J. Mitchell, & L. R. Jones (Eds.), Grading the nation’s
report card: Research from the evaluation of NAEP (pp. 5 - 43). Washington, DC:
National Academy Press.
Considerations for an AA-MAS
Page 192
Kenney, P.A. (2000). Market basket reporting for NAEP: A content perspective. Paper
presented at the March workshop of the Committee on NAEP Reporting Practices:
Investigating District-Level and Market-Based Reporting, National Research Council,
Washington, DC.
Ketterlin-Geller, L.R. (2008). Testing students with special needs: A model for understanding the
interaction between assessment and student characteristics in a universally designed
environment. Educational Measurement: Issues and Practice, 27(3), 3-16.
LaMarca, P. M. (2001). Alignment of standards and assessments as an accountability criterion.
Practical Assessment, Research & Evaluation, 7(21). Retrieved March 20, 2009 from
http://PAREonline.net/getvn.asp?v=7&n=21.
Mathematical Sciences Education Board. (1993). Measuring what counts: A conceptual guide
for mathematics assessment. Washington, D.C.: National Academy Press.
Messick, S. (1989). Meaning and values in test validation: The sciences and ethics of
assessment, Educational Researcher, 18(2), 5-11.
National Center for Educational Statistics. (2003). Mathematics Framework for the 2003
National Assessment of Educational Progress. Washington, DC: Author.
National Council of Teachers of Mathematics. (2006). Curriculum focal points for
prekindergarten through grade 8 mathematics: A quest for coherence. Reston, VA:
Author.
National Council of Teachers of Mathematics. (2000). Principles and standards for school
mathematics. Reston, VA: Author.
National Reading Panel. (2000). Report of the National Reading Panel, teaching children to
read: An evidence-based assessment of the scientific research literature on reading and
its implications for reading instruction. Washington, DC: National Institute for Literacy.
New York State Education Department. (2006). Mathematics Core Curriculum: MST Standard 3.
New York: The University of the State of New York & New York State Education
Department.
New York State Education Department. (2007a). New York State Testing Program 2007:
Mathematics, Grades 3-8. Technical Report. Monterey, CA: CTB/McGraw-Hill.
New York State Education Department. (2007b). New York State Testing Program 2007:
English Language Arts, Grades 3-8. Technical Report. Monterey, CA: CTB/McGraw-Hill.
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: The
science and design of educational assessment. Washington, DC: National Academy
Press.
Porter, A. C., & Smithson, J. L. (2001). Defining, developing, and using curriculum indicators
(CPRE Research Report Series RR-048). University of Pennsylvania: Consortium for
Policy Research in Education.
Considerations for an AA-MAS
Page 193
Roach, A. T., Niebling, B. C., & Kurz, A. (2008). Evaluating the alignment among curriculum,
instruction, and assessments: Implications and applications for research and practice.
Psychology in Schools, 45(2), 158-175.
Shanahan, T. (2003). Research-based reading instruction: Myths about the National Reading
Panel report. The Reading Teacher, 56, 646-655.
Spooner, F., Dymond, S. K., Smith, A., & Kennedy, C. H. (2006). What we know and need to
know about accessing the general curriculum for students with significant cognitive
disabilities. Research and Practice for Persons with Severe Disabilities, 31, 277-283.
Suurtamm, C., Lawson, A., & Koch, M. (2008).The challenge of maintaining the integrity of
reform mathematics in large-scale assessment. Studies in Educational Evaluation 34,
31–43.
Webb, N.L. (1997). Determining alignment of expectations and assessments in mathematics
and science education. NISE Briefs, 1(2), 1-11.
Webb, N. L. (1999). Alignment of Science and Mathematics standards and assessments in four
states. Washington, DC: Council of Chief State School Officers.
Considerations for an AA-MAS
Page 194
CHAPTER 6
DEVELOPING ITEMS AND ASSEMBLING TEST FORMS FOR THE ALTERNATE
ASSESSMENT BASED ON MODIFIED ACHIEVEMENT STANDARDS (AA-MAS)
Catherine Welch
Stephen Dunbar
In many respects, the development of items and assembly of test forms for specific
populations involves no special process considerations other than those required of any
professional test development activity. This chapter will begin with an overview of best practices
in item and test development in K-12 achievement testing in the context of the content domains
of English Language Arts (ELA) and mathematics. Although the processes involved may be
similar, the specific accountability context established by the AA-MAS guidelines, the potentially
diverse characteristics of students in the AA-MAS population, and the fact that states are
approaching AA-MAS designs in the presence of existing accountability tests developed under
federal guidelines for technical quality means that certain steps in the test development process
may deviate from typical best practice.
Nothing in the federal guidelines for the AA-MAS program specifies that the design and
development of the two-percent assessment be approached as a modification of an existing
general assessment or as an alternate assessment developed as a separate endeavor (U.S.
Department of Education, 2007). Given the position of the AA-MAS assessment in a difficult to
define gray zone between two existing assessments in each state, one can imagine its design
and development to follow an approach already established by a state (or by its contractors) for
any existing assessment, including an alternate assessment based on grade-level achievement
standards (e.g. Massachusetts). Alternatively, an AA-MAS assessment might be developed as a
kind of hybrid, consisting of features and materials from the general assessment and, where
required by considerations of accessibility for example, measurement approaches adopted in
the state‘s AA-AAS program. This chapter will discuss test development processes in general
Considerations for an AA-MAS
Page 195
that apply to whatever approach is taken by a state. Professional standards for test
development and for the assessment of students with disabilities (American Educational
Research Association, American Psychological Association, & National Council on
Measurement in Education, 1999) provide guidance regardless of the approach. Much of the
discussion in this chapter, however, presumes the AA-MAS approach taken by most states is to
modify the existing general assessment. In most settings, the general assessment (its essential
measurement features, its alignment to state content standards, its methods of scaling and
reporting student achievement and mapping onto achievement levels) establishes the technical
standards to be evaluated for purposes of reliability, validity, and comparability in the federal
peer review process. Modification of the general assessment is likely to be cost effective as
well. Thus, the argument to support inferences from the AA-MAS for purposes of accountability
is likely to be structured with a state‘s general assessment in mind. This provides states with a
logical starting point for developing the many justifications for resource allocation that set the
foundation for validity and comparability arguments (Kane, 2006; Marion, 2009; Abedi, Chapter
8, this volume) as articulated in federal guidelines.
This chapter begins with a discussion of best practice in test development. The purpose
of this discussion is to clarify key processes in development that contribute to the technical
qualities of any assessment. Specific aspects of the process prominent in K-12 achievement
testing for NCLB are noted as they represent key procedural steps that may be altered for the
AA-MAS. Special considerations for the AA-MAS context then are discussed in order to clarify
the implications of modified achievement standards and performance level descriptors (cf.
Perie, 2008) for item and test development. In this discussion, the advantages and
disadvantages of various options for modification are highlighted. Because a very real aspect of
development for the AA-MAS is modification of items from the general assessment, examples of
item analysis results from the latter are presented to illustrate approaches to identifying items for
modification. Finally, psychometric consequences of test modifications are discussed, as they
Considerations for an AA-MAS
Page 196
play out in the assembly of test forms and in the analysis of technical characteristics of items
and test forms. This chapter closes with consideration how best to document modifications
during test development so the case can be made for validity and comparability as well as for
interpretation of test results for both reporting on the achievement of individual students and the
use of results for AYP purposes.
Best Practice in Item Development and Forms Assembly
Test development plays a key role in validation, and validity considerations play a key
role in test development. The procedures to develop and revise test materials and interpretive
information lay the foundation for test validity. Meaningful evidence related to inferences based
on test scores can only provide scores with utility if test development produces meaningful test
materials. Content quality is the essence of arguments for test validity (Linn, Baker and Dunbar,
1991). Test development is undeniably important to the proper interpretation of test scores and
the inferences that are drawn from them (Kane, 2006). Users of test scores should study the
specifications for the test, how they were derived, and the process by which the test is
developed. Test development influences many aspects of validity (most importantly content
validity) and many types of inferences. The purpose of this section is to discuss the issues and
considerations associated with best practice in developing tests. The considerations that are
provided in this chapter are consistent with the Standards for Educational and Psychological
Testing (AERA, APA & NCME, 1999). The Standards constitute a seminal guide for proper test
design and development.
Key to proper test design and development is the incorporation of universal design
principles for all types of test items. These principles suggest an approach to assessment
development based on principles of accessibility for a wide variety of end users. These
principles are as applicable to the general assessment population as they are to a student
population preparing to take an AA-MAS. Thompson, Johnstone, and Thurlow (2002) described
Considerations for an AA-MAS
Page 197
seven elements of universally designed assessments including inclusive test population;
precisely defined constructs; accessible, non-biased items; tests that are amenable to
accommodations; simple, clear and intuitive procedures; maximum readability and
comprehensibility; and maximum legibility. Further guidance for states on universally designed
assessments is provided by Lazarus, Thurlow, Christensen and Cosmier (2007).
Test Domain
A critical stage in the development of a statewide assessment is the design stage. In the
design stage, important overall decisions about the test are made including: establishing a
validation foundation for the test based on the state‘s academic content and achievement
standards; designing test specifications that align with those standards; and reviewing, refining,
and reaffirming validity evidence for test design.
Before test design can take place, it is important that the test developer understand the
link between test purpose and test domain. The NCLB Act requires all public school students to
participate in statewide assessments. The primary purposes of these assessments (given
annually in certain grades and subjects) is to measure student progress towards state
achievement standards and to hold schools, districts, and states accountable to bring all
students to a proficient level in reading and mathematics. Test domain is used to refer to the
various attributes used to define what a test should measure, including content topics, tasks,
and process levels. In the NCLB context, states‘ content and achievement standards provide
this definition of test domain. Understanding this connection between purpose and domain
allows developers to determine what should and what should not be included on a statewide
assessment. To be able to develop items that measure the test domain, developers need to
define the test domain explicitly (i.e., which of the state‘s content standards are eligible for
inclusion on the statewide assessment and which are not).
Considerations for an AA-MAS
Page 198
One of the major struggles with current statewide assessments is the large number of
standards that most of the states have adopted and the need to align content standards with
curriculum and statewide assessments. There are many issues in alignment, stemming from the
wide variation in the specificity and clarity of state standards in defining what students need to
know and be able to do, an imbalance between the number of standards and the testing time
available, and the lack of agreement about the relative importance of the standards and the
emphasis each receives in the statewide assessment. Several methods for evaluating alignment
have been developed in recent years (Webb, 1999; Rothman, Slattery, Vranek, Resnick, 2002;
Bhola, Impara, & Buckendahl, 2003). Using empirical validation strategies that focus on
alignment can help identify, through more objective means, those standards that are the most
important priorities for inclusion on the assessment.
Another important consideration at the early stages of test design, and particularly
important for the AA-MAS, is defining the examinee population(s) for whom the test is intended
(Quenemoen, Chapter 2, this volume). It is important to define the characteristics of the
students who will constitute the examinee population for the test. Specifying the examinee
population must take into account examinee characteristics that fall outside of the requirements
of the test, but may constrain or confound the examinee‘s performance on the test.
Test Specifications
The next step in test design is to specify the important attributes of the items and test
forms. Test specifications are often called blueprints because they specify how the test is to be
constructed. Derived directly from test philosophy, test purpose and use, test audience, and
empirical validity evidence gathered for the test, test specifications delineate the requirements
for the subsequent stages of development, review, field testing, assembly, and evaluation of the
end product. The test specifications should identify the content domain, cognitive processes, the
Considerations for an AA-MAS
Page 199
number and balance of items, the technical characteristics of the test, and the appropriate item
formats (AERA, APA & NCME, 1999, Standard 3.3). Each of these is presented below.
Content domain. The empirical methods described above can be used to define the
content topics to be included in the domain measured by a test. The ideal level of specificity of
content topics in test specifications is that which ensures adequate control of all crucial
elements of the content standards. In defining the content domain, the test developer must
understand the structure of the content domain, how topics within the domain relate to each
other, and how students build their knowledge over time.
Processes. It is equally important to specify the process requirements of the items in the
test. These cognitive requirements represent the breadth, depth and range of complexity that
students have been taught to use within the context of the content domain. There are multiple
approaches to the classification of items by cognitive demand (see Pugalee & Rickelman,
Chapter 5, this volume, for more detail).
Distribution of content and processes. Each content area and process needs to have a
weight assigned to it in the test specifications that represents the relative emphasis to be placed
on the topic or skill in a test form or item pool. These weights are important to the assembly of
multiple, parallel test forms, and they are especially relevant in establishing comparability of an
AA-MAS and the general assessment.
Item formats. Following the specification of test content and processes, the next aspect
of the test specifications involves identifying the type(s) of items to be developed. At this point,
the test developer needs to define the item format features that are required by the content and
process specifications, specify the item types that possess those features, and comparatively
evaluate each item type to identify those that might be preferred for reasons of coverage,
economy, precision, response time, development and scoring costs, delivery constraints, and
feasibility. The content and skills to be measured should drive the choice of item type. Selectedresponse (aka multiple-choice) items may be significantly more efficient in the amount of
Considerations for an AA-MAS
Page 200
information they gather per unit of testing time, but constructed-response items can add a
performance dimension to observed scores and can be scored reliably, although not as
inexpensively as selected-response items (Welch, 2006). Lindquist (1951) argued that the item
type should match the criterion of interest. He indicated that the test developer should make the
item format as similar to the criterion format as possible, recognizing the constraints of
efficiency, comparability, and economy. As noted by Pellegrino (Chapter 4, this volume), there is
no necessary connection between response format and cognitive level (e.g. multiple choice
items can be used to assess higher-order thinking, and some performance tasks only measure
surface knowledge).
Test Length. The next substantive consideration under the category of test specifications
is length. The optimum test length is one that is accurate enough to support the inferences that
will be made on the basis of the test results. Test length is a function of many concerns, most of
which have been described including content coverage and item formats. However, in addition
to such concerns, test length is also a practical constraint of testing time. Testing time may be
influenced by constraints such as the administrative time periods such as class periods or the
age of the student examinees
Technical Characteristics. The test developer must consider the specifications for the
test(s) as a whole. This includes consideration of statistical specifications such as estimates of
reliability, distribution of content and processes across the test form, test organization,
administrative plan, and special accommodations.
To the extent the test specifications are well specified, the test forms produced will be far
more parallel than they would be if developed from general specifications. By developing
detailed specifications, the test developer considers many specifics of the test-development
process before that process is begun, thereby resolving many of the issues that will arise during
development. By carefully considering the major aspects of the testing process, the test
developer can identify inconsistent or conflicting specifications early in the design process.
Considerations for an AA-MAS
Page 201
Well-developed test specifications drive the entire item-development and test-assembly
process and serve as helpful directions to item writers, reviewers, and test users. Kane (2006)
notes the importance of linking test specifications, item development, and forms assembly with
the interpretive argument that will be used in attaching meaning to test scores. In this sense,
validity is grounded in the test development process itself.
The content-validation process is also ongoing. Evidence supporting the test
specifications should be reaffirmed on a regular basis. If state assessments are to reflect the
curriculum and the expectations of teachers as to what their students need to be ready to learn
or what they should have learned, test developers need to engage in a regular process to
collect evidence to adjust or reaffirm the test specifications. Test design is at best an iterative
process, one that repeatedly cycles through information gleaned through item development, test
administration, and evaluation.
Item Development
Sound test development depends on well-defined, defensible item development. Sound
item development is critical for providing the quality and consistency necessary to produce
reliable test scores upon which validated test-score inferences can be made.
The development process should include considerations of universally designed
assessments. Thompson, Johnstone & Thurlow (2002) identify specific questions for test
developers to take into account as they develop items and design assessments. The
considerations of universal design appropriate for all stages of test development include:
1. Incorporate elements of universal design in the early stages of test development.
2. Include disability, technology, and language acquisition experts in item reviews.
3. Provide professional development for item developers and reviewers on use of the
considerations for universal design.
4. Present the items being reviewed in the format in which they will appear on the test.
Considerations for an AA-MAS
Page 202
5. Include standards being tested with the items being reviewed.
6. Try out items with students.
7. Field test items in accommodated formats.
8. Review computer-based items on computers.
Item Writing. Item writing is very much an iterative process, but it can be undertaken in a
standardized manner. Item-development processes need to establish principles and procedures
that take into account the various audiences and purposes of the program. Item-development
processes for constructed-response items may also include the initial drafting of the scoring
rubric simultaneously with the item writing. The qualifications of the item writers, the security of
the process, and the training are all essential considerations for the item-development process.
The process adopted for developing items in any testing program is critical and must be
considered in relation to issues of validity, reliability, and interpretability. The determination of
the source of the item content depends upon test purpose and the inferences that need to be
made based upon that content. Identifying those individuals who are qualified to develop items
will be dependent upon the requirements of a particular assessment. Common procedures
reflect a concern for demographic characteristics such as representation of the racial/ethnic
backgrounds and gender of the examinee population.
Item Reviews. Once items have been developed, they should be subjected to a
multistage, multipurpose review for content accuracy, fairness, universal design, and
psychometric concerns. As with item writers, it is critical that item reviewers be experts in the
area for which they are being recruited, that they be representative of the examinee population,
and that they receive standardized training on the item attributes they are being recruited to
evaluate.
The content reviewers should then be asked to review the items according to a set of
established criteria. These criteria include scrutinizing items to ensure that they:
1. Align with the specified content standards,
Considerations for an AA-MAS
Page 203
2. Match to the specified processes,
3. Are technically correct,
4. Include effective distractors for multiple-choice items
5. Include draft scoring rubrics for constructed-response items,
6. Show clarity in response options (keyed option correct, distractors incorrect), and
7. Adhere to the specified item format.
Reviewers should also provide guidance on how to rephrase item stems, propose
alternative keys and distractors, clarify scoring criteria, and identify ambiguous or confusing
language in order to improve item quality. This guidance could be informed by cognitive
interviews, think-alouds, and piloting the items in individual administrations to examine their
cognitive demands.
All item development should be attended to fairness both in principle and in practice.
Both the Code of Fair Testing Practices in Education (2004) and the AERA/APA/NCME
Standards include obligations for ensuring fairness to test takers. The Standards also address
obligations to ensure fairness through all stages of test development, test administration, and
test use.
Assessments should also be reviewed for consistency with universal design principles to
help ensure that optimal, standardized conditions are available for all students and that the test
materials students encounter do not present unnecessary complexity in surface appearance.
Although content reviews are critical in consideration of internal qualities of a test,
fairness reviews are equally essential in large-scale assessment programs as they are designed
to ensure that all test takers have a comparable opportunity to demonstrate what they know and
can do. Test fairness starts with design of the test and its specifications. It then continues
through every stage of the test-development process, including item writing and review, item
field testing, item selection and forms construction, and forms review. These reviews help to
Considerations for an AA-MAS
Page 204
ensure that items are evaluated from diverse viewpoints, not least of which are based on
multicultural and gender-related perspectives (Camilli, 2006; Schmeiser & Welch, 2006).
Field Testing. Once the items have been reviewed and problems with them addressed,
the items are typically prepared for field testing. Following the field test, item evaluations should
be conducted using the field test data. Statistical analyses of field-test data typically include item
analysis that is used to identify items that may be problematic. For constructed-response items,
analyses may include: (a) descriptive statistics, such as the mean performance, standard
deviation of the mean performance, range of responses, and frequency distribution of
responses; (b) rater consistency and reliability estimates; and (c) correlations with multiplechoice items. For multiple-choice items, analyses may include difficulty and discrimination
indices. It may also include an analysis of the distractors, student response patterns, and
indications of speediness.
Items that appear statistically flawed should be carefully reviewed for possible contentrelated problems and for structural problems with the item (e.g., inadvertent cues to the key or
distractors that are too close to the key).
Test Assembly. This stage of the development activity includes the process of selecting
and organizing a particular set of items that will constitute a given form of a test. Test form
assembly requires expert-level knowledge and skills in test construction, including an
understanding of the relationships between the content and statistical characteristics of the
items in a test and the test‘s measurement properties. Though test assembly is guided by test
specifications, it also requires the well-reasoned decisions of a test developer who understands
the relevant measurement principles and the judgments of content experts.
Test Specifications, Item Development, Forms Assembly,
and Item-Level Statistics for the AA-MAS
The primary purpose of this section is to describe approaches and strategies and identify
various considerations that test developers of AA-MAS should take into account as they develop
Considerations for an AA-MAS
Page 205
or modify general state assessments to create new alternate assessments. The intention is to
focus on various steps of the development process, as outlined in the section on best practice,
particularly those that are most relevant or available for modifications.
Several major assumptions guide the discussion in this section:
1. Modified achievement standards do not imply that the content standards are being
modified. Rather, AA-MASs must adhere to the state content standards and must cover
the same breadth and depth as the general assessment. The AA-MAS must be aligned
to the content standards with respect to the content and process specifications but may
be less difficult.
2. Given the substantial investment that states have made in the design and development
of their testing programs, states may elect to modify existing assessments as a preferred
approach to developing an AA-MAS.
3. The AA-MAS must satisfy reasonable technical requirements in terms of validity and
reliability (Sato, Rabinowitz, Worth, Gallagher, Lagunoff & Crane, 2007). In order to
maximize the validity of the AA-MAS, test developers must follow the same rigorous and
iterative approach that has been established as best practice in test development.
Considerations for an AA-MAS
Page 206
Figure 6-1. Schematic Diagram of Interplay among Test Design Components
Content
Standards
Test
Specifications
Achievement
Standards
Item Development, Test Assembly and Delivery
Cut score on the
general state
assessment
Figure 6-1 implies that there is a parallel relationship between content and achievement
standards and test specifications and cut-scores. The content standards define what students
need to know. The achievement standards define how well the students know the content
standards and determine which students are proficient and which are not.
The test specifications are the translation of the content standards into assessment
language (what will be on the test that the students need to know). The cut-score on the
assessment determines the minimum score that indicates proficiency.
If test developers are not allowed to modify the content standards, then the test
development effort for the AA-MAS should focus on test specifications (and item and test
development) for allowable modifications. Depending upon the extent of these modifications, the
AA-MAS may require either a new standards-setting process to locate the cut score
representing the modified achievement standards, or a validation study to examine the fidelity of
the existing cut score given the extent of the modifications to test specifications and items. If
Considerations for an AA-MAS
Page 207
the achievement descriptors for the modified achievement standards are rewritten to reflect a
changed definition of proficiency, a new cut score study would also be necessary.
Special Considerations for Test Specifications
As with any assessment, the establishment of the test domain for the AA-MAS is the first
consideration. State content standards and achievement standards define the test domain in the
assessments being used to meet the requirements of NCLB and in the AA-MAS. The content
standards are not to be modified. However, a review of the state content standards would be an
appropriate first step. Grade-level content standards could be reviewed to ensure that students
have an opportunity to achieve grade-level content. Students must have access to and must
have received instruction in the grade-level content.
As discussed earlier, test specifications should include articulation of the content areas,
process skills, and the balance between the two. They also include decisions about the item
format, test length and technical characteristics of the assessment. Careful consideration of
modifications introduced in the test specifications phase of development may produce
assessments that more closely model the range of grade-level content appropriate for students
eligible for this assessment. This may provide students with a better opportunity to be assessed
on the same grade-level content standards as all other students, but with modifications to the
expectations for the mastery of the content. Access to these modified assessments based on
these changes to test specifications will ideally provide a better estimate of the student‘s
achievement. However, making these modifications is not without compromise. Changes to the
test specifications can result in modified assessments that are less comparable, less reliable or
even less valid than the original assessment. Table 6-1 provides an overview of modifications
that would be viable based on changes to test specifications. The advantages of making these
changes and the potential limitations of such changes are also provided.
Considerations for an AA-MAS
Page 208
Table 6-1. Proposed Modifications to Test Specifications
Test
Specification
Content
Process
(or Cognitive
Level)
Content by
Process
Weighting
Item Formats
Proposed Modification
Reduce the number of items per content
standard
Reduce the number of items per process
standard
Advantages
Possible Limitations
Maintains alignment with content standards
Maintains alignment with content standards
Comparability of test specifications of
the AA-MAS to the general assessment
May alter proportional representation of
tested construct(s)
Ensures accessibility of test materials
Reallocate the process skills to reflect a
more appropriate match to student
abilities in terms of breadth, depth and
complexity
Adjust the relative weights of the content
and process dimensions
Diversify the item formats to maximize
inclusion of those that are preferable for
content and process coverage
Reduces difficulty
Improves match of content strand to
appropriate level of cognitive processes
The interaction between process and
content is often difficult to quantify.
Allows for partial credit to be given for shortanswer, extended responses, and other
types of open-ended items
Comparability of the test specifications
of the AA-MAS to the general
assessment
Allows for fewer items to cover more content
and process standards if the appropriate
items are written and appropriate
adjustments made to scoring rubrics
Comparability of the scores from the
AA-MAS to the general assessment
New scales may need to be established
Additional open-ended items would
require additional resources for scoring,
additional time for reporting
Exposure of items and need for
additional forms of the assessment
Designing scoring rubric to be aligned
with content standards while improving
accessibility for students of interest
Test Length
Reduce the reading load of the assessment but maintain the number of items
Reduce the overall number of items in the
test
Considerations for an AA-MAS
Allows students more time per item
Field testing of open-ended items on
appropriate student population
Reduces reliability
Decreases speediness impact of the
assessment
Reduces precision of the cut-score
decisions
Reduces impact of student fatigue
Page 209
Technical
Characteristics
Reduce the overall difficulty of the
assessment by eliminating the most
difficult items proportional to content
standards
Increase proportion of students with IEPs
exceeding the cut score
Increase information about total score peritem included
Replace the most difficult items with
simpler items covering the same content
standards
Reduces precision of the cut-score
decisions
May alter construct representation
Increase costs associated with item
development
Increase the overall discrimination of the
assessment by adding appropriate items
Reduce the overall difficulty of the
assessment by eliminating higher order
process items
Considerations for an AA-MAS
Reduces reliability
Page 210
Whenever modifications such as these are considered, experts who possess knowledge
of the student population, can access relevant information, and are familiar with the state
content standards are a critical part of the process. These experts need to address the complex
interactions of the various approaches. For example, reducing the number of higher-order
process items may benefit this particular student population, but may not be consistent with the
state content standards.
Any modifications to test specifications must be consistent with the guiding assumptions
cited previously in this chapter. That is, AA-MASs must adhere to the state content standards
and must cover the same breadth and depth as the general assessment.
Using information from the New York State Testing Program 2007 Technical Report for
Mathematics, Grades 3-8 (NYSED, 2007, December), the example shown in Table 6-2
illustrates a hypothetical modification of content specifications and the relative weights of item
format for the Grade 5 Mathematics. The entries in Table 6-2 in regular type are the number of
items for each content standard for the general assessment in 2007, whereas the entries in
italics are proposed modifications for a potential AA-MAS.
Table 6-2 – Implications Based on Modifications to Test Specifications
Content Standard
Number Sense and
Operations
Algebra
Geometry
Measurement
Statistics and
Probability
Totals
Multiple-Choice
14 (9)
Constructed-Response
1 (1)
Points Allocated
16 (11)
3 (2)
4 (2)
2 (1)
3 (2)
1 (1)
3 (2)
2 (1)
1 (1)
6 (4)
12 (8)
6 (4)
6 (4)
26 (16)
8 (6)
46 (31)
Note: Entries in regular type are for the general assessment, while the entries in italics and parentheses are for a
potential AA-MAS.
The modification to the test specifications preserves (as closely as possible given fixed
counts of items and points in the general assessment) the proportion of items aligned to each
content standard as well as the proportion of items in each format based on a one-third
reduction in the total number of items. It should be emphasized that the one-third reduction in
Considerations for an AA-MAS
Page 211
this example does not represent an arbitrary selection of items to remove from a general
assessment, such as the most difficult items, but instead items whose removal does not alter
the content balance nor detract from the technical quality of the resulting AA-MAS. The guiding
principle is to remain true to the overall specifications while reducing the length (i.e., number of
items) of the test.
An additional dimension that may be considered at this stage is process or cognitive
level of the items in the AA-MAS. NCLB guidelines require evaluation of cognitive level, and test
specifications in many states reflect this aspect of items as well as content strand. Even though
cognitive level may not be specified on an item-by-item basis during test assembly, a
distribution of items is often identified for three or more levels of a cognitive hierarchy, and
attention to these features of items is important in proposed modifications for AA-MAS. Because
constructed-response items (and the rubric specifications for high scores on those items)
typically define higher levels of a cognitive hierarchy, their proportional representation in the AAMAS is critical.
Proportional representation of content specifications, cognitive levels, and item formats
is intended to preserve certain aspects of test validity to yield comparability. The reduction in
total-score points and number of items can have a predictable effect on reliability. In the
example, the NYSED math assessment had reported reliability coefficient of .93. The reliability
estimate of the modified assessment depicted in the table is .87.
Special Considerations for Item Development
The item development process involves many varied, yet related, considerations. In this
context, item development refers to the three major processes of item writing, item reviewing,
and field testing. Although many of the processes are similar for both selected-response and
constructed-response items, there are also characteristics of these two item types that would
Considerations for an AA-MAS
Page 212
suggest they should be discussed separately. Tables 6-3 and 6-4 present possible
modifications, their advantages and limitations, for these two categories of items.
Consistent with needs for the design of the test specifications, item development for the
AA-MAS will involve the identification of experts who are familiar with the student population and
who are expert in providing appropriate and sufficient access to the general curriculum to
prepare students to complete this assessment. Identifying experts to assist in the drafting or
revising of items and reviewing these items for a variety of issues related to the student
population needs will be critical. It will be critical that the role of these experts remain very
central throughout the entire development process. Guidelines for use by the item writers and
reviewers should include strategies for adapting items for students eligible for the AA-MAS.
Frequent iterations of items should be expected in this process. All newly created items will
need to be generated, reviewed, and revised throughout the development process by experts.
All modified items should be subjected to the same rigorous review and refinement process.
Reviews should take place as early in the process as possible, maximizing the benefits of the
reviews prior to field testing.
Research on Item Modifications. Since the first draft regulations for the AA-MAS were
issued by USED, ideas for item and test modification have appeared in white papers and plans
submitted by states for peer review. Some of these ideas are included in this discussion. Much
like the early years of work on testing accommodations for students with special needs,
students with disabilities, and English language learners, ideas and innovations grow out of
administrative imperatives and policy considerations for inclusion of all students in assessment
and accountability programs. Empirical studies of the effects of accommodations on
comparability of test score interpretations tend to lag behind the innovations themselves.
Although some research on adapting learning and assessment tasks for students with
mild disabilities has been completed (AIR, 2000; Bergeson, Wise, Gill & Barlett, 2001) that
would provide support for these suggestions, it would be misleading to assert that the
Considerations for an AA-MAS
Page 213
suggestions offered here for modifications of existing items for accessibility in the AA-MAS
context have a strong research base or have been shown empirically to justify the spirit and
intentions of the law with regard to comparability of test-based inferences or fidelity to the
accountability provisions of NCLB. Rather they should be understood as rational approaches to
the challenges of the AA-MAS that should be validated as any other aspect of an accountability
system should be validated.
Empirical studies need to be conducted in order for test developers to provide the
information necessary for appropriate interpretation. For example, studies that demonstrate a
consistency between scores on a child‘s AA-MAS with other types of information about the child
(IEP team evaluations, classroom performance) should be conducted. Studies that examine the
relationship between the AA-MAS and other measures of the same constructs that are not
necessarily used for state accountability purposes (performance on formative assessments,
performance on diagnostic assessments) should also be planned. Research should also be
planned to examine the internal structure for the AA-MAS as compared to the general
assessment. Results from factor analysis for the AA-MAS could be compared to factor analysis
results on the general assessment.
Considerations for an AA-MAS
Page 214
Table 6-3. Proposed Modifications to Item Development Process for Multiple-Choice Items
Process
Item Writing
Proposed Modification
Prepare reading passages; and related items
with as much scaffolding as possible
Control stimulus complexity to allow for the
minimum level of complexity while remaining
aligned with the content standards
Write items with effective distractors for the
AA-MAS population
Advantages
Increase accessibility of items
Possible Limitations
Increased development time
Maximize students‘ ability to
demonstrate what they know
Increased development budget
Comparability issues
Reduce effects due to distractibility and
fatigue
Alignment issues
Remove construct irrelevant variance
due to visual acuity, linguistic complexity
Increase chance of correct
response by guessing
Ensure distractors are contributing to
student information
Increased development time
Use figures, pictures and graphs to aid
students in understanding the items
Item Reviews
Field Testing
Remove irrelevant language from items that
may distract students
Review items for the possible revision of
distractors that are attracting a very limited
number of students
Increased development budget
Review items for the possible revision of
distractors that are misleading to students
Ensure relevance to classroom
experiences and consistency with
everyday learning supports
Review items for irrelevant language
Graphics aid understanding
Review figures, pictures and graphs for
appropriate contributions and relevance
Conduct cognitive interviews, cognitive labs
and think-alouds
Provides preliminary analysis of
processing levels
Field test items on student populations that
are representative of students eligible for the
AA-MAS to investigate the appropriateness
and feasibility of the modifications.
Field test all new or revised items on the
appropriate sample of students
Provide relevant statistics for use in
forms assembly
Provide statistics on both the AA-MAS
and general assessment students to
identify different response patterns
Field test parallel variations of items (i.e.,
Considerations for an AA-MAS
Page 215
Comparability issues
Increase development budget
Limited access to appropriate
students
with various levels of scaffolding, with
various levels of instruction, with various
levels of language complexity) to identify
those working most appropriately
Beta test items on small samples of students.
Conduct think-alouds with students to identify
characteristics that are benefiting students
and could be duplicated in future item.
Generate DIF statistics from the field test to
help make item selection
Generate item analysis statistics from the
field test to help evaluate an item‘s ability to
discriminate for the students of interest
Considerations for an AA-MAS
Page 216
Table 6-4. Proposed Modifications to Item Development Process for Constructed-Response Items
Process
Item Writing
Proposed Modification
Develop items that lend themselves to
scaffolding, allowing students the
opportunity to work through controlled
sections of the items
Advantages
Increase accessibility of items
Maximize student‘s opportunity to fully
demonstrate what they know
Possible Limitations
Generalizability of results may be
reduced
Reliability of the assessment may be
reduced
Use figures, pictures and graphs to aid
students in understanding the items
Item Reviews
Field Testing
Articulate the scoring criteria when the
item is unusually drafted
Review scoring criteria for content,
fairness and universal design
considerations
Conduct cognitive interviews, cognitive
labs and think-alouds
Field test items on student populations
that are representative of students eligible
for the AA-MAS to investigate the
appropriateness and feasibility of the
modifications
Scoring
Beta test items on small samples of
students. Conduct think-alouds with
students to identify characteristics that are
benefiting students and could be
duplicated in future items.
Allow for partial credit of responses
(based on scaffold items structure)
Evaluate item responses separately for
both content and process skills
Apply different types or scoring rubrics to
the same item responses
Increase appropriate difficulty of items
Provide preliminary analysis of
processing levels
Solicitation of responses from
representative students to be used
establish the scoring criteria.
Exposure of items
Security of items
Provide relevant statistics for use in
forms assembly
Generalizability of results may be
reduced
Maximize the unique information
available from constructed-response
items
Reliability of the assessment may be
reduced
Generate distributional statistics for all
constructed – response items for the AAMAS and general assessment students
Considerations for an AA-MAS
Costs of studies
Page 217
Comparability to general assessment
may be weaker
Modifications in Item Development. Consistent with the previous section on Research on
Item Modifications, the effect of item modifications should be empirically studied. The methods
used to modify the items should be thoroughly described as part of the validation process.
Empirical and logical evidence should be also be provided. Table 6-5 illustrates the application
of item modifications to several sample items. Modifications such as those recommended for
items 2 and 3 employ the principles of universal design. Such principles are most appropriately
included in the standard development procedures for all new item development. When the
selected approach is the modification of an existing assessment, universal design principles are
critical to inclusion in the modification process.
Considerations for an AA-MAS
Page 218
Table 6-5. Examples of Modifications in Development
Original Item
Description of Modification
Revised Item
Anna baked 6 of 10 cupcakes for her classmates.
Which number sentence describes how many more
cupcakes Anna has to bake?
Anna needs to bake 10 cupcakes. She has baked
6. Which number sentence describes how
many more cupcakes Anna needs to bake?
A 10 + 6 = □
B 10 – 6 = □
C 10 × 6 = □
D 10 ’ 6 = □
A 10 + 6 = □
Simpler sentence structure
Use of additional space between distractors
Use of bold text to highlight question
B 10 – 6 = □
C 10 × 6 = □
D 10 ÷ 6 = □
Recycling brought to Green River Recycling Plant last
month:
Week 1:
1,178 pounds
Week 2:
1,065 pounds
Week 3:
1,879 pounds
Week 4:
1,997 pounds
The closest estimate of the total recycling taken to
Green River Recycling Plant was
.
A 4,000 pounds
B 6,000 pounds
C 8,000 pounds
D 10,000 pounds
Considerations for an AA-MAS
Green River Recycling
Week
1:
2:
3:
4:
Table with title, clear headings and reduced
verbiage
Pounds
1,178
1,065
1,879
1,997
Alignment of numerals
Less text in title
What is the best estimate of the total recycling
taken to Green River Recycling last month?
A 4,000 pounds
B 6,000 pounds
C 8,000 pounds
D 10,000 pounds
Page 219
Question format changed from incomplete
sentence
Sarah and her family went to the grocery store. At the
store Sarah and her brother Kyle went up and down
the aisles looking for their favorite snacks. They each
bought 2 snacks. One snack cost $2. How much did
the children pay for the snacks altogether?
Sarah and her brother bought 2 snacks each at
the grocery store. One snack cost $2. How much
did the children pay for the snacks altogether?
A $4
B $8
C $12
D $24
B $8
Reduce demands on working memory
Use of additional space between distractors
A $4
Alignment of numerals
C $12
D $24
How do the authors portray Luis in the second
paragraph?
How is Luis described in paragraph 2?
Simpler sentence structure
A A student with many interests.
A As an eager student with many interests
B As a popular boy with many friends
C As someone who preferred performing to
schoolwork
D As someone who had trouble deciding what he
wanted to do
According to the passage, what is a dory?
A
B
C
D
A wild bird
A large pail
A small boat
A body of water
Reduce irrelevant detail
B A boy with many friends.
Use of additional space between distractors
C Someone who didn‘t like schoolwork.
D Someone who couldn‘t decide what he liked.
In the line marked with , what is a dory?
A A wild bird
B A large pail
Example of the use of a support
C A small boat
Potential for scaffolding
D A body of water
Considerations for an AA-MAS
Visual aid introduced to help focus student
on appropriate place in the reading
passage
Page 220
Special Considerations for Forms Assembly
Table 6-6 presents similar modifications for the forms assembly and administration of the
AA-MAS. As with the previous sections, advantages and limitations exist for every type of
modification. After items have been developed, field tested, revised, and deemed eligible for
inclusion on an operational form, the test developer will select the operational items from the
pool of field-tested items, using all available data (item-level statistics for difficulty,
discrimination, DIFF, IRT parameter estimate). In general, test developers will be more
successful in assembling forms if an item pool exists that allows for some degrees of freedom in
the selection of items for inclusion on the operational form. Although building such a pool would
require additional time and resources from the state, the benefit of such efforts would be
realized in the assembly process.
Test developers should complete a match-to-specifications report based on the final
assembled form. This process ensures the alignment of the modified assessment to the content
standards, the test design and specifications and the guidelines for item selection. This process
also provides documentation of the overall characteristics of the form and how these
characteristics compare to the target test specifications. Comparisons of distributions of item
difficulties and discriminations from the field test statistics to the target technical distributions
should be made. Estimates of reliability for the assembled form and estimates of the standard
error of measurement should also be included in the match-to-specifications report. This is
critical information to the review and approval of the assembled form. This information provides
one last opportunity for the test developer to make changes to the composition of the
assessment before an operational administration.
Considerations for an AA-MAS
Page 221
Table 6-6. Proposed Modifications to Test Assembly Process
Process
Selecting items for
inclusion on the
operational form
Proposed Modification
Assemble forms from the least difficult to the
most difficult item
Advantages
Increased accessibility for students
Possible Limitations
Comparability to general assessment
Eliminate distractions for students
Comparability to general assessment
Reduces fatigue
Reduces examinee error
May differ from general assessment
administration format
Assemble items to reduce the number of
items within any one test section
Assemble items to minimize changing from
one content standard to another (for
example, within a math test, group the
geometry items together, then group the
measurement items together)
Test layout
Maximize white space in the test booklet
Follow principles of universal design
Test administration
Limit the number of items per page or screen
presentation
Minimize the number of items presented in
any separately timed section of the
assessment (for example, if a 44-item math
test could be divided into two 22-item
sections, assemble and administer in the
shorter blocks)
Minimize the transferring of information from
a test booklet to an answer document by
offering online delivery, consumable test
booklets or other mechanisms for capturing
the student responses
Considerations for an AA-MAS
Page 222
The greater the change in development, selection, presentation and administration of
items change, the less likely it will be that states can ―link‖ the performance on the AA-MAS to
performance on the general assessment. However, one strategy, the reduction of the number of
items proportional to the test specifications for the general assessment offers the possibility of
relating or linking the reduced version of the assessment to the full length of the assessment. In
such instances, the modified assessment may be structured to maintain the intuitive
understanding of the standard score scale used for the general assessment. This approach may
offer some utility with respect to the interpretability of the results. Table 6-7 offers a quick
summary of the impact on comparability and the scale for three tiers of change.
Table 6-7. Impact of AA-MAS Strategy on Comparability and Scale
Possible Strategy
Develop new assessment
Modify items (i.e., shift in item formats,
number of distractors, scaffolding)
Reduce number of items (proportional to
content standards)
Comparability to
General Assessment
None
Limited
Linked
Scale Considerations
New standard setting
New scale necessary
New standard setting
New scale possible
Retention of scale
Validation of cut scores with
standard setting study
Special Considerations for Evaluating Statistical Characteristics of AA-MAS Items
A standard activity in any test development context is the statistical evaluation of items
for an assessment. In the AA-MAS context this might happen in various places in the item and
test development workflow as different types of item statistics become available. Preliminary
data might exist, for example, from small samples in which item modifications are pilot-tested for
accessibility and feasibility of administration. Field test data on larger samples may be reviewed
after formal content and sensitivity reviews take place and prior to test form assembly. In test
development processes using item-response theory, item- and person-fit statistics on larger
samples may be needed. A significant challenge in the AA-MAS context, however, is the fact
Considerations for an AA-MAS
Page 223
that item statistics may not be readily available at ideal times from ideal samples of students
from the population.
A reasonable approach to developing an empirical basis for item selection and
modification is to examine conventional item analysis statistics for items in the general
assessment. Because states have been administering their general assessment to all students
except the one-percent with the most severe disabilities since 2006, item-level data for students
who might be deemed eligible for the AA-MAS assessment presumably exist. Statistical
characteristics of items in this target population may provide some insights for item selection
and modification.
Item Analyses for Contrasting Groups. Conventional item analyses for dichotomous,
multiple-choice items produce observed percent correct or p-values to measure item difficulty,
correlations between items, and total scores to measure item discrimination, as well as more
detailed indicators of item functioning such as the percent of examinees choosing each multiplechoice option, and correlations between option choice and total score as measures of distractor
discrimination. Also informative for the latter concept is the percent of high- and low-scoring
examinees choosing each distractor. In addition, many state assessments are likely to have
similar item statistics based on item-response theory. Test developers use indicators of item
difficulty to assemble test forms appropriately matched to the achievement level of the
examinee population and indicators of discrimination to ensure some degree of homogeneity in
the selected items. Both item characteristics influence the reliability and internal validity of the
assembled test form.
Of interest to the present discussion is the extent to which item statistics might provide
insight into the performance of items in the target population for the AA-MAS. The specification
of that population means that p-values are likely to be, by definition, smaller in the AA-MAS
population than in the full examinee population as that population consists of students not likely
to be proficient on the general assessment. One might also expect item-total correlations to be
Considerations for an AA-MAS
Page 224
smaller due to the restricted range of total scores in the AA-MAS group. A full array of item
analysis statistics can provide test developers with some guidance on the relative performance
of items in the two-percent population. For example, items with marked differences in difficulty
and markedly low discrimination for the AA-MAS group could be argued to contribute to low
scores without contributing to observed score variance for the students of interest. Such item
statistics combined with poorly performing distractors could support the elimination of these
types of items from the two-percent assessment.
Table 6-8 provides a concrete example of distributions of p-values and item-total
correlations on grade 5 state math and reading assessments in an examinee group identified as
potentially eligible for the AA-MAS in a given state and the general student population. The AAMAS group consisted of students who had IEPs and who were deemed not proficient in two
consecutive years of the general assessment. The results in the table are based on the second
of those years.
Table 6-8. Mean (SD) Difficulty and Discrimination for Items in an AA-MAS and General
Assessment
Reading
Math
Population
Difficulty
Discrimination
Difficulty
Discrimination
General
.67 (.14)
.57 (.12)
.68 (.13)
.57 (.13)
AA-MAS
.37 (.15)
.27 (.09)
.35 (.10)
.30 (.10)
As can be seen from the entries in Table 6-8, the mean difference between item difficulty
in the two student populations is substantial and translates into effect sizes of 2.5 and 2.1 for
math and reading, respectively. Standardized mean differences of this magnitude are extremely
rare in typical comparisons of subgroups in educational testing and suggest that whatever
modifications of the general assessment are introduced for the AA-MAS population, their impact
on performance must indeed be great if the AA-MAS form is expected to alter AYP results.
Proposed modifications of items during AA-MAS test development must attend to characteristics
of items in such a way that the modification effort will have a measurable impact on test results
Considerations for an AA-MAS
Page 225
in the context of accountability. Specific modifications based on item statistics will be considered
below.
The distributions of p-values and item-total correlations are shown in the stem-and-leaf
plots in Figure 6-2. Each statistic is expressed on a scale from 0 to 1. The plots show the tenths
digit in bold and the hundredths digit in regular type. Hundredths digits to the right in italics are
for the general student population, and those to the left are for students eligible for the AA-MAS.
As can be seen from the figure, the distributions of p-values and item-total correlations are
generally symmetrical in both the general population and the group identified for the AA-MAS.
The p-values for math in the AA-MAS sample are somewhat positively skewed. The distinctive
feature of these distributions is the small degree of overlap. This is of particular interest in the
case of the item discrimination indices. Ideally test developers would like the distribution of
discrimination indices to be similar. However, range restriction on total score is likely to
systematically lower item-total correlations in the AA-MAS population. The dramatic separation
of the distributions of these correlations suggests there may be additional reasons for low
discrimination. Understanding why would be an important part of AA-MAS test development if
modification of items from the general assessment is the selected approach. The item-total
correlations suggest that even within the AA-MAS population, items in these sets, irrespective of
content, possess idiosyncratic characteristics that reduce their overall correlation with total
scores. If these characteristics can be isolated by content or statistical analyses of item keys
and distractors, for example, then perhaps target modifications at the item level could at once
make items less difficult and increase their internal consistency and correlations with total test
scores. Some illustrations of these ideas are presented below.
Considerations for an AA-MAS
Page 226
Figure 6-2. Stem-and-Leaf Plots of Item Difficulty (proportion correct) and Discrimination (itemtotal correlation)
Item Difficulty
Mathematics
.9 5
.9 022
.8 699
.8 00012334
86 .7 689
2 .7 00001233344
7 .6 56667789
32 .6 0011223
8765 .5 5567788
11 .5 0023
977655 .4 89999
43321 .4 2234
998888876 .3 6
33222221100 .3
97666655 .2
4433211100 .2
9999 .1
0 .1
.0
.0
Reading Comprehension
.9
.9
.8 5568
.8 1233444
.7 6778
.7 122
.6 567779
.6 0122234
65 .5 579
42 .5 1223444
97665 .4 5
4310 .4
9876655 .3 9
443222211000 .3
9865 .2
332221 .2
.1
4 .1
.0
.0
Item Discrimination
Mathematics
Reading Comprehension
.8
.8 0
.7 55
.7 9
.7 0001111334
.7 02234
.6 5556799
.6 57889
.6 01112333444
.6 01112334
5 .5 5566678889
.5 55669
4 .5 001111122233444
.5 123455
77665 .4 6677
6 .4 6666779
00112444 .4
433 .4 0124
9876 .3 7999
88665 .3 689
44433221110 .3 03
43332211 .3
99998777776555 .2
998655 .2
444433322 .2
44432222110 .2
9998888885 .1
88775 .1
42 .1
442 .1
9 .0
6 .0
.0 0
.0
General population (italics) and AA-MAS examinee populations
Considerations for an AA-MAS
Page 227
Table 6-9 gives an example of a distractor analysis from a statewide mathematics
assessment given to nearly 40,000 students in grade 5. The item measures a student‘s ability to
compute total length of six objects and to convert inches to feet. Data marked AA-MAS are
from a group identified previously as eligible for a modified assessment. This item has
reasonable statistical properties in the general population (difficulty = .42, discrimination = .53).
In that population, 35% are drawn to the combination of numbers, 5 and 4, that reflect correct
calculation of total length but no conversion from inches to feet (54 inches, thus 5 feet 4 inches).
A simple distractor analysis shows the nature of the error most common in the general student
population.
Table 6-9. Illustration of Distractor Analysis
A brick is 9 inches long. If 6 bricks are lined up, one after the other, in
a row, how long is the row of bricks?
Options
General
AA-MAS
Performance
Performance
A
1 foot 3 inches
9%
17%
B*
4 feet 6 inches
42%
19%
C
5 feet 4 inches
35%
25%
D
6 feet 9 inches
14%
39%
Item Statistics:
37,223
2,432
Sample Size
.42
.19
Difficulty
.53
.12
Discrimination
The students in the AA-MAS eligible group were drawn to option C as well. Those
students demonstrated a similar misunderstanding. However, option D (6 feet 9 inches) was
the most frequent (39%) response in the AA-MAS group. This distractor simply repeats the
specific numbers used in the item stem and indicates no calculation and no conversion of units.
The summary statistics (difficulty = .19, discrimination = .12) indicate this item to be providing
very little information about total test scores for the AA-MAS population. Based on such
information, the test developer could choose to replace distractor D with a different option,
eliminating the repetition of specific numbers used in the stem, or reduce the number of
distractors by eliminating D.
Considerations for an AA-MAS
Page 228
This concrete example highlights the complexities involved in modification strategies
such as elimination of distractors. Eliminating the most popular distractor in the total group
would do little to change the behaviors of this item in the AA-MAS group. Moreover, such a
strategy would eliminate the distractor that carries a meaningful message in an error analysis.
Distractor elimination is clearly going to alter the construct interpretation of item performance
and perhaps do so without any gain in relative item difficulty and impact. This concrete example
is designed to illustrate that eliminating distractors can have untold effects on the meaning of
resulting test scores. Cases such as this might be better addressed by the elimination of entire
items that can be arguably shown to contribute little to total scores in the AA-MAS population.
Differential Item Functioning. An apparently straightforward analysis for identification of
items for modification in the AA-MAS context would be Differential Item Functioning (DIF). DIF
methods are designed to detect items with different item characteristic curves in two
populations, in other words systematic differences in the item performance of examinees in
groups matched on general achievement in the domain. As discussed by Abedi (Chapter 8, this
volume), DIF methods are routinely used as part of the test development process to screen
items for psychometric appropriateness with respect to background characteristics of
examinees such as gender, ethnicity, SES, native language, and disability status. DIF methods
have the potential to provide insight into facets of item design that may unknowingly create a
relative advantage or disadvantage to examinees that is unrelated to the construct measured by
the test.
In the AA-MAS, DIF methods might be thought to offer insight into differential
performance at the item level. As noted by Abedi, however, when DIF methods are used in
assessment context with multiple focal groups of interest (e.g. students with disabilities,
linguistic minorities, ethnic minorities, etc.) it can become difficult to find consistency in the
flagging of items. Moreover, the statistical limitations of DIF methods have necessitated the
development of judgmental criteria for evaluating the magnitude of DIF (e.g. its expected
Considerations for an AA-MAS
Page 229
influence on total scores, Zieky, 1993) to supplement statistical criteria used in testing null
hypotheses of no DIF. In particular, DIF methods tend to perform poorly when groups differ
markedly in overall test performance and when the variable used for matching (typically total
test score) does not allow adequate matching throughout its range (add DIF references). Given
the effect sizes presented previously as well as the distributions of p-values in a prospective AAMAS population relative to the general population, DIF methods are likely to prove difficult to
apply in the test development process for the AA-MAS. Large mean differences between groups
and sparseness of scores in the upper ranges of total score distributions are likely to produce
spurious DIF in the AA-MAS context (Holland and Thayer, 1988; Camilli, 2006).
Validating the AA-MAS
Validity remains the most fundamental consideration in developing and interpreting any
assessment. Although this chapter has been devoted to the development of a sound AA-MAS, it
is the use and interpretation of these scores that must be validated. There are numerous
sources of evidence that might be used to evaluate a proposed interpretation. One critical type
of validity evidence is based on the test content. As defined in the Standards (AERA, APA, &
NCME, 1999), content refers to the themes, wording and format of the items, tasks or questions
on a test, as well as the guidelines for procedures regarding development, administration, and
scoring. This chapter has attempted to discuss issues about differences in meaning or
interpretations of test scores for an AA-MAS when compared to a general assessment. Of
particular concern was the extent to which construct-irrelevant components could be eliminated
to avoid disadvantaging students eligible for an AA-MAS, to create an assessment that provides
students an opportunity to demonstrate what they know and can do. Consistent with Marion
(Chapter 9, this volume), content-related evidence requires evaluating the interaction of both
content and process required of the test items and documenting that the interaction is what is
expected.
Considerations for an AA-MAS
Page 230
The responsibility for validating the AA-MAS is shared between test developers, users
and education policymakers. An important aspect of test validation in this regard is the
documentation of the test and item development processes used for the AA-MAS, the specific
steps followed in the test development workflow, the types of item modifications chosen, the
expert analysis of the cognitive demands of the items, the impact measured through thinkalouds, cognitive interviews, and field testing, studies examining differences in performance on
items, and the changes to test specifications and distributions of items across formats, content
strands and cognitive levels. Education policymakers are likely to weigh in on general
parameters of AA-MAS development such as the representation of subject matter included, the
process of setting standards, and the budget allocated for test development, delivery, and
reporting. More specific aspects of validation are a joint responsibility of test developers and
users. In statewide assessment, SEAs are both developers and users, but SEAs typically work
with one or more contractors who carry out the activities associated with item and test
development. Validation of the AA-MAS may be a responsibility of an SEA but an action step
that is incorporated into the RFP process and the deliverables specified during contract
negotiations. If the goal is to develop the foundation for a validity argument in support of
proficiency-related inferences based on the AA-MAS (Kane, 2006), then the outline of the
argument needs to be formulated in the joint work of an SEA and its contractors.
Considerations for an AA-MAS
Page 231
References
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education (1999). Standards for Educational and
Psychological Testing. Washington, DC: American Psychological Association.
American Institute of Research (AIR) (2000). Effects of item scaffolding on student responses: A
cognitive laboratory study. Washington, DC: AIR for Institutes for Research for the
National Assessment Governing Board in support of contract # RJ97153001.
Assessment and Accountability Comprehensive Center (2007, November). Assessments based
on modified achievement standards: Critical considerations and implications for
implementation. San Francisco: WestEd.
Bergeson, T., Wise, B. J., Gill, D. H., & Bartlett, K. M. (2001). Adaptations are essential: A
resource guide for adapting learning and assessment tasks for students with mild
disabilities. Olympia, WA: Special Education Section of the Office of the Superintendent
of Public Instruction. Available at:
http://www.k8accesscenter.org/accessinaction/documents/EARLYwritingADAPTATIONS
.pdf.
Bhola, D. S., Impara, J. D., & Buckendahl, W. (2003). Aligning tests with states‘ content
standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3),
21-29.
Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp.
221-256). Westport, CT: American Council on Education/Praeger.
Code of Fair Testing Practices in Education (2004). Washington, DC: Joint Committee on
Testing Practices.
Filbin, J. (2008) Lessons from the initial peer review of alternate assessments based on
modified achievement standards. Washington, DC: U.S. Department of Education, Office
of Elementary and Secondary Education.
Holland,P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel
procedure. In H. Wainer and H. Brown (Eds.), Test validity (pp. 129-145). Hillsdale, NJ:
Erlbaum.
Johnstone, C. J., Altman, J., & Thurlow, M. (2006). A state guide to the development of
universally designed assessments. Minneapolis, MN: University of Minnesota: National
Center on Educational Outcomes.
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp.
17-64). Westport, CT: American Council on Education/Praeger.
Lindquist, E. F. (1951). Preliminary considerations in objective test construction. In E. F.
Lindquist (Ed.), Educational Measurement (pp. 119-158). Washington, DC: American
Council on Education.
Considerations for an AA-MAS
Page 232
Linn, R. L, Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment:
Expectations and validation criteria. Education Researcher, 20(8), 15-21.
Marion, S. (2007). A technical design and documentation workbook for assessments based on
modified achievement standards working draft. Minneapolis, MN: University of
Minnesota, National Center on Educational Outcomes.
Perie, M. (2008). A guide to understanding and developing performance-level descriptors.
Educational Measurement: Issues and Practice, 27(4), 15-29.
New York State Department of Education (2007, December). New York State Testing Program
2007: Mathematics, Grades 3-8, Technical Report. Albany, NY: Author.
Rothman, R., Slattery, J. B., Vranek, J. L., & Resnick, L. B. (2002). Benchmarking and
alignment of standards and testing (CSE Technical Report 566). Los Angeles: UCLA
Center for Research on Evaluation, Standards and Student Testing.
Sato, E., Rabinowitz, S., Worth, P., Gallagher, C., Lagunoff, R., & Crane, E. (2007, September).
Evaluation of the technical evidence of assessments for special student populations
(Assessment and Accountability Comprehensive Center Report). San Francisco:
WestEd.
Schmeiser, C.B. & Welch, C.J. (2006). Test development. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 307-353). Westport, CT: American Council on
Education/Praeger.
Thompson, S. J., Johnstone, C. J., & Thurlow, M. (2002). Universal design applied to largescale assessments (NCEO Synthesis Report 44). Minneapolis, MN: University of
Minnesota, National Center on Educational Outcomes.
U.S. Department of Education (2007a, April 9) Final Rule 34 CFR Parts 200 and 300: Title I —
Improving the Academic Achievement of the Disadvantaged: Individuals with Disabilities
Education Act (IDEA). Federal Register. 72(67), Washington DC: Author. Available at:
http://www.ed.gov/admins/lead/account/saa.html#regulations.
U.S. Department of Education (2007b, July 20), Modified Academic Achievement Standards:
Non-regulatory Guidance. Washington, DC: Office of Elementary and Secondary
Education, U.S. Department of Education. Available at:
http://www.ed.gov/admins/lead/account/saa.html#regulations.
U.S. Department of Education (2007c, December 21). Standards and Assessment Peer Review
Guidance: Information and Examples for Meeting Requirements of the No Child Left
Behind Act of 2001. Washington, DC: Office of Elementary and Secondary Education,
U.S. Department of Education. Available at:
http://www.ed.gov/policy/elsec/guid/saaprguidance.pdf
Webb, N. L. (1999). Alignment of science and mathematics standards and assessments in four
states. (Research Monograph No. 18). Madison, WI: National Institute for Science
Education.
Considerations for an AA-MAS
Page 233
Welch, C. (2006). Item and prompt development in performance testing. In S.M. Downing and
T.M Haladyna (Eds.), Handbook of test development (pp. 303 – 327). Hillsdale, NJ:
Erlbaum.
Zieky, M. (1993). Practical questions in the use of DIF statistics. In P. W. Holland and H. Wainer
(Eds.),Differential Item Functioning (pp. 337-347). Hillsdale, NJ: Erlbaum.
Considerations for an AA-MAS
Page 234
CHAPTER 7
DEVELOPING MODIFIED ACHIEVEMENT LEVEL DESCRIPTORS AND SETTING CUT
SCORES
Marianne Perie
In developing an alternate assessment based on modified achievement standards, test
developers can improve the design process by paying close attention to those modified
achievement standards. By collaborating with policymakers, they can together work through the
issues of what proficiency means for this group of students, what type of modified achievement
standards are appropriate, and how best to measure them. In fact, defining the achievement
standard is the one area where it is most important for policymakers to work directly with test
developers to ensure coherence so that what is intended as a state standard and goal is
actualized through the assessment and interpretive materials. As discussed previously, the
USED regulation requires that the modified achievement standards:
Be aligned with a state‘s academic content standards for the grade in which the student
is enrolled.
Be challenging for eligible students, but may be less difficult than grade-level academic
achievement standards.
Be developed by grade level, not grade span.
Include at least three achievement levels.
Be developed through a documented and validated standard-setting process that
includes broad stakeholder input.
The only other guidance that policymakers are given from the federal government in
defining modified achievement standards is that they are expected to represent a ―less difficult
expectation of grade-level content standards.‖ Inherent in this guidance is a tension between
ensuring the tests measure the same breadth and depth while being less difficult. This chapter
will examine how the achievement standards work within that tension to provide a less difficult
Considerations for an AA-MAS
Page 235
performance target. It will analyze the different dimensions of modified achievement standards,
describe the various components, and provide suggestions for drafting descriptors and setting
cut scores.
Defining Achievement Standards
An achievement standard defines a level of performance and includes both a minimum
cut score and a written description that distinguishes the level of performance from other
defined levels. It consists of four components: number of levels, names of levels, a descriptor
for each level, and a cut score for each level.
Numbers
According to the regulations, the number of modified achievement levels to be defined
includes a minimum of three: one distinguishing proficient performance, one above, and one
below. However, some states may want to add one or two more levels to meet the goals of their
assessment. The majority of states have four performance levels for the general assessment. It
may be desirable to mimic the structure of the general assessment by including the same
number of achievement levels for the modified assessment so that report cards and interpretive
material can be standardized as much as possible. Or, as is the case for New York, it may be
necessary to include the same number of levels to more easily incorporate them into a formula
for a performance index calculation. However, it is also an option to consider these levels
extensions of the lowest grade-level achievement standard, in which case a parallel number
would not be necessary. In this case, a state would need to consider the number of levels
necessary to convey the message intended by these modified achievement standards.
Caution is urged in developing more than four levels. As noted in Perie (2008), it can be
difficult to describe meaningful differences across more than four levels. In addition, any
particular test has a fixed amount of measurement power that depends primarily on the number
and quality of the questions in the test. ―The more cut scores there are in any given test, the
Considerations for an AA-MAS
Page 236
less measurement power the test developer can devote to each cut score, and the less
information there is around each cut score.‖ (ibid, page 17). Given the nature of the AA-MAS
and the typical rationales for developing one (i.e., school accountability), it seems unlikely that
more than four levels would be needed.
Names
Beck (2003) indicated that naming conventions should be developed as the first step in
defining performance. With modified achievement levels, the first question in naming them is
whether the names should be the same or different from the level used in the general
assessment. While it may be tempting to assign the levels the same name, state policymakers
could also consider using different names to avoid confusion and simply designate one name to
be the equivalent of ―proficient‖ for purposes of AYP. In fact, some states have received
feedback from their peer reviewers advising them to select different names for their modified
achievement levels.
Policymakers can also consider how these modified standards relate to grade-level
standards and portray that in the name. That is, if these modified achievement standards are
truly downward extensions of the grade-level achievement standards, the names should reflect
their relationship with the general assessment. For instance, some state policymakers have
considered naming the levels relative to the general assessment, such as ―not ready for the
general assessment,‖ ―almost ready for the general assessment,‖ and ―ready for the general
assessment.‖ Or, the same idea could be used to talk about achievement relative to grade-level
standards (e.g., ―near grade-level proficiency‖). Some state policymakers have also chosen not
to call any modified achievement level ―advanced‖ as they believe student performance needs
to be measured against grade-level standards before it can be called ―advanced.‖ It is important
to keep in mind that the names of the modified achievement levels often express the values of
the policymakers or the intent of the assessment.
Considerations for an AA-MAS
Page 237
Descriptors
Achievement level descriptors put into words how good is good enough. That is, they
qualitatively describe the performance expected of a student at the ―proficient‖ level or the
―basic‖ level. They must be aligned with the state academic content standards and describe
breadth and depth of the standards appropriate to the assessment, so that they represent
knowledge and skills that are evaluated by the assessment. Although the breadth and depth of
the assessment must be parallel to the general assessment, these modified descriptors may be
written to a ―less difficult level.‖ That is, while the assessment must measure similar levels of
depth of knowledge, perhaps competency of a lower depth of knowledge is all that is needed to
be proficient using the modified achievement standards.
Ideally, the descriptors will be written so that they clearly differentiate among levels and
progress logically across levels. That is, to improve articulation across levels, write the
―proficient‖ descriptor to be appropriately more rigorous than the ―basic‖ descriptor. In addition,
considering the entire assessment program will help ensure that the descriptors also progress
logically across grade levels (e.g., the descriptor for grade 5 ―proficient‖ is sufficiently more
challenging than the descriptor for grade 4 ―proficient.‖) It is important to take great care in
writing the descriptors as they drive not only the standard setting process, but also the reporting,
score interpretation, and potentially the item-writing process. In fact, many in the field claim that
the descriptors are instrumental to the validity and defensibility of the standard-setting process
(cf., Cizek & Bunch, 2007; Hambleton, 2001). More detail will be provided about this step later in
this chapter.
Cut Scores
The fourth component of achievement standards is the cut score. Cut scores define the
number of points necessary to reach each performance level. They are typically set after the
assessment has been field tested so that statistics are available to inform the process. Then
Considerations for an AA-MAS
Page 238
recommended cut scores come from a committee using any of a number of possible
methodologies to determine the best cutoff points. The regulations require the use of a
documented and validated methodology, but the choice of methods is left up to the test
developers and policymakers. Ideally, a broad range of stakeholders would be involved in the
process, typically including both special educators and content experts. It is important to fully
document the process, including a rationale for selecting a particular methodology and the
process for selecting the committee. More detail on the methods, procedures, and
documentation will be provided later in this chapter.
Defining Proficiency
The biggest issue that state policymakers will wrestle with is what proficiency means for
these students. That is, we need to determine what we mean by ―modified‖ achievement
standards. Defining the levels is an important step in standard setting. Berk (1996) discussed
the importance of providing explicit behavioral descriptions of each level, saying ―the
interpretation of the final cut scores hinge on the clarity of the behavioral definitions‖ (p. 224).
Previous chapters discussed issues related to the interaction of cognition, instruction, and
assessment and provided some insights into providing this clarity. Understanding cognition and
improving instruction can have large implications for determining what proficient means on a
given assessment. And it is here that policymakers will wrestle with making a test of similar
breadth and depth ―less difficult.‖
Taking information from the earlier chapters on how students learn the content, and
ways in which the content increases in difficulty, provides some insights into writing meaningful
descriptors. If there was one learning progression that all students followed, the task of writing
achievement level descriptors would be greatly simplified as we could simply identify points on
the learning continuum that represent ―basic,‖ ―proficient,‖ or ―advanced‖ achievement.
However, as discussed previously (see Pellegrino, Chapter 4, this volume; Pugalee &
Considerations for an AA-MAS
Page 239
Rickelman, Chapter 5, this volume), there is little to no agreement on standard learning
progressions for any population, let alone a population of students with disabilities.
With this population that may include all disability types and different learning
progressions, it will be vitally important to clearly define the population and understand why the
students are not achieving at grade level before we can describe proficient performance for
them. By considering the grain sizes (depth and breadth) of learning targets along a continuum
(Gong, 2007), instructional scaffolding that best supports how they learn, and an appropriate
level of cognitive challenge for their grade level, we can better understand achievement of these
students as compared to students without disabilities. These differences will greatly influence
the writing of PLDs.
For example, Pellegrino (Chapter 4, this volume) discusses the possibility that low
achievers may have a similar set of knowledge and skills as high achievers but may not have
cognitively organized that information as efficiently so they are not able to access it as readily.
One solution is to design a test that reduces the burden on working memory or that includes
supports to help students better organize information or more easily determine the best strategy
to solve a problem. This type of theory would need to be captured both in the test design and in
the definition of proficiency. As another example, Pugalee and Rickelman (Chapter 5, this
volume) discuss ways of modifying the domain targets systematically within each depth of
knowledge level. This approach could again be explored in both the test design and the
descriptors. Most importantly, there should be a guiding philosophy about the model of learning
for students with disabilities who are low achievers and thus eligible for this assessment. That
guiding philosophy should drive the definition of proficiency and the test design simultaneously.
As discussed in Perie, Hess, & Gong (2008), it is usually important to consider the
definition of proficiency for the 2% assessments long before standard setting, as it could drive
the design of the assessment. That is, we can work to develop items that measure the features
that policymakers have determined are important to distinguish proficient performance from
Considerations for an AA-MAS
Page 240
performance below that level. However, as discussed by Welch and Dunbar (Chapter 6, this
volume), it is also possible to modify the general assessment using statistical information
gathered from an administration of the general assessment to the target population. If, as they
suggest, a test developer takes the option of creating an AA-MAS by simply eliminating the
most difficult items proportional to the content standards, the cut score could be mapped from
the general assessment to the AA-MAS. Then, the descriptor would be modified after the fact—
focusing on the general knowledge and skills measured by the items that appear to map to each
achievement level.
One issue that several states are considering is whether the AA-MAS is at the lower end
of some continuum that includes the general assessment or whether it is a completely separate
test that measures the same content standards but to a less rigorous extent. For instance,
policymakers need to decide whether they see the AA-MAS as a stepping stone for students to
move towards grade-level achievement standards, or whether they believe that a student‘s
disability will require a different type of assessment. One implication for this decision is the
definition of proficiency. Should proficiency be defined in terms of how ready a student is to be
assessed on grade level assessments or should it be defined simply as proficient on this
separate assessment with no explicit or implicit link to performance on the general assessment?
Another, similar, consideration is how this AA-MAS fits between the AA-AAS and the
general, grade-level assessment. Most states appear to be developing an AA-MAS that is closer
in design to the general assessment than to the AA-AAS. But, how should the achievement
standards compare? One possibility is to consider proficiency as being just below proficiency on
the general assessment—that is, somewhere between ―basic‖ and ―proficient‖ performance on
the general assessment. Another possibility is to simply shift down the levels one step, so that
―proficient‖ performance on the AA-MAS will be similar in nature to ―basic‖ performance on the
general assessment. This approach is one way to keep the breadth and depth similar across the
two assessment types but making the AA-MAS ―less difficult‖ by requiring less knowledge and
Considerations for an AA-MAS
Page 241
fewer skills to reach proficiency. This type of relationship among the assessments would have
implications for the intended comparability of the assessments. (Please refer to Abedi, Chapter
8, this volume for more details.) It also has implications for the development of the modified
achievement level descriptors. In this case, the committees would start with the grade-level
descriptors for both basic and proficient, and try to write a modified proficient descriptor that falls
in between the two, or perhaps closer to the basic level.
State policymakers‘ beliefs and values also come into play as they consider whether
students who would take this assessment are capable of learning grade-level materials to the
grade-level standard. One possible theory is that these students can learn grade-level material
as well as their nondisabled peers, but they take longer to master each unit and thus do not
complete the curriculum by the end of the year. Following this theory would lead to a description
of proficiency that is similar to grade-level proficiency for material learned earlier in the year, but
requires less of students on material learned later in the school year. However, this approach
could be difficult to defend as it may violate the mandate that the breadth must remain
equivalent across the two assessments and only the difficulty may be modified. The breadth
described by the modified proficient descriptor should not be narrower than the breadth of the
grade-level proficient descriptor.
Another theory is that these students can learn grade-level material as well as their
nondisabled peers, but they require specific supports to do so. That is, the ultimate goal for
reaching proficiency may be the same, but it includes conditions. For example, the proficiency
standard may include clauses that describe the scaffolds available on the test, such as
segmenting text, providing strategies, supplying definitions, etc. Then the descriptor could
indicate that the student measured against modified achievement standard has similar
knowledge as the proficient student measured against grade-level achievement standards, but
he/she may require more supports (e.g., less vocabulary load in the test item, use of graphic
organizers to organize information before solving a problem) to demonstrate that knowledge.
Considerations for an AA-MAS
Page 242
Regardless of which theory drives the process, it is important to articulate that theory
and clearly state the inferences policymakers and educators wish to draw from the AA-MAS.
Fitting this context with the design decisions made and the definition of proficiency is central to
forming a coherent validity argument, which will be discussed more fully in Marion (Chapter 9,
this volume).
Applying Theories of Learning to Modified Achievement Level Descriptors
If state policymakers start with the perspective that the modified achievement level
descriptors are closer in nature to the grade-level achievement standards than to the alternate
achievement standards, then one strategy for drafting the descriptors is to start with the gradelevel descriptors and modify them appropriately.2 These modifications can take several forms,
depending on the theory one is following, as described in the previous paragraph.
The first question that needs to be answered is whether the knowledge and skills
required for proficiency within the modified achievement standard are the same as with the
grade-level achievement standard but with more supports and scaffolding, or the knowledge
and skills are actually different. If those drafting the descriptors believe the first description is
true, that the standards are the same but students require appropriate supports, then, they can
modify the grade-level descriptor for ―proficient‖ accordingly. For example, a grade-level
standard may state ―student is able to read a fictional text and identify key elements of the story‖
while a modified standard may state ―when the text is chunked meaningfully, the student is able
to read a fictional text and identify key elements of the story.‖ Other examples of adding
scaffolds to the descriptors include: ―A proficient student can comprehend the main message
within segmented grade-level text. With suggested reading strategies or graphic organizers,
students are able to generate and/or answer inferential questions.‖ These statements only differ
2
Note that while it is also possible to start with the alternate achievement level descriptors and modify
them to make them more difficult, this approach may be more challenging as many alternate achievement
standards do not cover all content standards and are often based on extended content standards rather
than grade-level content standards.
Considerations for an AA-MAS
Page 243
from grade-level descriptors through the addition of the scaffolds. Note that it is important to
ensure that these scaffolds are included in the test design if they are included in the descriptor.
Furthermore, these scaffolds will only be helpful on the test to the extent that they have been
used during instruction. (See Pugalee & Rickelman, Chapter 5, this volume for more information
on scaffolding.)
Other strategies for modifying descriptors apply if those designing the test believe that
the knowledge and skills required of these students should be different. First of all, under the
current federal regulations ―different‖ can only mean less difficult. There are several ways to
make grade-level achievement standards less difficult. One option is to focus on the cognitive
complexity of the requirement and reduce it appropriately. For instance, a grade-level descriptor
at grade 8 may state that a student can ―evaluate algebraic expressions‖ while the modified
descriptor could require the student to ―identify algebraic expressions.‖ Likewise, if the gradelevel descriptor says a student can solve ―two-step problems‖, a possible modification is to
require students to solve ―one-step problems.‖ For English Language Arts, we can reduce the
complexity either by reducing the depth of knowledge required (e.g., move from analyze to
describe) or qualify broader statements of knowledge. For instance, if the grade-level standard
requires students to identify various parts of speech, including ―nouns, verbs, pronouns,
adjectives, adverbs, conjunctions, and interjections,‖ one could modify that standard by reducing
the number of parts of speech required by removing the requirement of identifying conjunctions
and interjections. Both standards require students to identify broader parts of speech, but the
modified standard reduces the difficulty by only requiring students to identify simpler parts of
speech. These modifications to the descriptors make the achievement standards less difficult to
reach by reducing cognitive complexity, which complies with federal regulations as long as the
depth and breadth of the assessment itself remains similar to the general assessment.
In practice, those drafting the modified achievement level descriptors could choose to
adopt more than one of these strategies. That is, they could choose to reduce the depth of
Considerations for an AA-MAS
Page 244
knowledge required for proficiency on some of the skills, add scaffolds to the statements about
other skills, and provide specific examples to others indicating that the student is required to
perform a narrower range of the tasks than what is required in the grade-level standards, as
long as that narrower range still matches the content standards and indicators. Specifics on
writing achievement level descriptors will be discussed in the next section.
Procedures for Drafting Modified Achievement Level Descriptors
Regardless of the type of assessment, it is usually preferable to start considering
achievement level descriptors early in the test development process. In the case of the AA-MAS
(and all assessments developed under NCLB), the most important distinction is between
achieving ―proficient‖ and not, so a strong understanding of proficiency is needed. By
considering this early, test developers can start an iterative process of using the descriptors to
help design the test and then refining the descriptors as needed to match the final test
blueprint.3 When the descriptors are used to drive the test design, test developers can ensure
that the test blueprint supports the desired judgments and that the items themselves provide
opportunities for students to show what they know and can do relative to the achievement
standards. Consideration can be given to distinguishing items that would likely be answered
correctly by students who met the definition for proficiency and incorrectly by those who did not.
Our recommended approach for drafting achievement level descriptors is to involve a
committee of people who know the content and the students. However, they will always need
some direction from the policymakers regarding the intent of the assessment program. In the
case of modified achievement level descriptors, the committee will typically include both special
education teachers as well as content specialists for each subject area (e.g., reading and
mathematics). Content specialists could be subject area teachers, curriculum supervisors, or
3
If the test developer is shortening the test by eliminating the most difficult items proportional to the
content standards, then the descriptors will be considered after the test is administered. This process will
be discussed further in this section.
Considerations for an AA-MAS
Page 245
members of the general public with a specialty in that subject (e.g., a mathematician).
Approximately 5–8 participants are needed per subject area, but if descriptors are being
developed for multiple grade levels, consider inviting more participants and splitting them into
teams. That is, a group of eight participants can write the descriptors for grade 4, and then they
can separate into two groups of four with one group working on grade 3 and the other on grade
5.
The direction required from the policymakers will include the assumptions made about
the population, including who the students are and what the barriers are to their ability to
achieve grade-level proficiency on the general assessments. The committee members will also
need to understanding the theory behind the revisions and enhancements made to the
assessment as well as see examples of those revisions and enhancements. They also need to
understand the type of modifications that will NOT be permitted, such as providing below gradelevel passages on a reading assessment. In addition, if any data analyses have been done such
as those suggested in Quenemoen (Chapter 2, this volume), the committee could be informed
by concrete information about what was learned from these analyses, including specific
examples of items this population seemed to perform well on and those they did not.
Once the committee members have sufficient background, the real work drafting the
descriptors begins. The majority of those developing modified achievement level descriptors are
starting from the grade-level descriptors and editing them rather than starting new descriptors
from scratch. Regardless of the approach taken, Perie, Hess, and Gong (2008) recommend that
the committee discuss several issues, including:
Interactions of process and content (e.g., is this a routine application of skill or transfer of
known skill to a new context?)
How students move both across performance levels and across grade levels.
Considerations for an AA-MAS
Page 246
o
Are the knowledge and skills required of Proficient on the modified achievement
standard the same as on the general assessment, but some scaffolding is
needed, or are the knowledge and skills different?
o
If they are different, is the content different or the processes? (e.g., both can
make inferences at the Proficient level but the grade-level achievement
standards require that the inferences are made in a more complex context than
the modified achievement standards, or grade-level achievement standards
require students to make inferences, while modified achievement standards
require students to only draw basic conclusions from concepts presented
directly)
o
How do you see students moving across grades? For example, how does
Proficient in one grade compare to Proficient in the next?
Transition from this assessment to the general assessment – how are they linked?
o
Should the proficient level of the modified achievement standards be an
indicator of readiness for achievement on grade-level standards?
o
Should the state adopt a policy regarding the modified achievement standards,
such as students who score at the advanced level on the modified achievement
standards must take the general grade-level test the following year?
Given their answers to these questions and the theories regarding appropriate revisions
to the test design, the committees can then draft descriptors. Recall from the previous section
that some of the modifications could include: (1) reducing the cognitive complexity of the
required skill, (2) decreasing the number of elements required, or (3) adding appropriate
supports and scaffolds to the description of the knowledge and skills required. The following is
an example of a fifth-grade reading descriptor for a general assessment and the modified
version that includes all three types of modifications.
Considerations for an AA-MAS
Page 247
Grade-Level Descriptor
Proficient students comprehend the
message within grade-level text.
Using supporting details, they are
able to analyze information from the
text and summarize main ideas.
Before, during, and after reading,
students generate and/or answer
questions at the literal, inferential,
interpretive, and critical levels.
Students interpret and use
organizational patterns in text, (e.g.,
description, cause/effect,
compare/contrast, fact/opinion) in
order to gain meaning. They use
informational text features (e.g.,
index, maps, graphs, headings) to
locate information and aid in
comprehension. Students are able
to identify and analyze elements of
narrative text (e.g., characters,
setting, and plot). Additionally, Level
II students can identify author‘s
purpose and recognize how author‘s
perspective influences the text.
Considerations for an AA-MAS
Modified Descriptor
Proficient students comprehend the
message within segmented gradelevel text. Students will be able to
identify the main idea and retell
information from the passage with
supports (e.g. a web, 5 W‘s chart, T
chart), when appropriate. During
and after reading, students are able
to generate and/or answer questions
at a literal level. Students identify
and use organizational patterns in
text (e.g., sequence,
compare/contrast, fact/opinion) in
order to gain meaning. They use
informational text features (e.g.,
index, maps, graphs, charts) to
locate information and aid in
comprehension. When given
supports (e.g., story maps, character
web, illustrations), students are able
to identify basic elements of
narrative text (characters, setting,
beginning/middle/end). Additionally,
Level II students identify author‘s
purpose when given the definitions.
Page 248
Having similar structure between the grade-level and modified descriptors helps
teachers, administrators, and parents see the difference between grade-level proficiency and
modified proficiency, providing useful information in what it takes to move a student from the
alternate assessment based on modified achievement standards to the general assessment
based on grade-level achievement standards.
Earlier, a different approach to modifying descriptors was introduced, following from the
suggestion by Welch and Dunbar (Chapter 6, this volume) that an AA-MAS could be designed
by eliminating the most difficult item proportional to the content standards. This approach would
result in an AA-MAS that was similar to the general assessment in scope but shorter. The two
assessments could be statistically linked together since there are common items across both
populations. Then, the cut score could be mapped directly onto the AA-MAS. As discussed by
Welch and Dunbar (ibid), a standards validation will need to be conducted to ensure the cut
scores divide student performance meaningfully into the achievement levels. Once the cut
scores have been validated, the grade-level descriptors can be modified by taking into
consideration the items that map to each achievement level. Different rules have been used to
identify items within each level, usually focusing on the likelihood that a student within that level
would answer the item correctly compared to the likelihood of a student below that level
answering the item correctly. Items that are distinct between these two groups are identified as
mapping to that level. Then, content experts can summarize the types of knowledge and skills
represented by those items and use those summaries to write descriptors. This approach
focuses solely on the item specifications, as scaffolds have not been used in this test design.
However, caution must be taken to avoid writing descriptors that are too specific to one test
form. In addition, there would still need to be a guiding philosophy driving this approach,
including defining proficiency. The philosophy should relate to our understanding of how
reducing difficulty in this manner addresses some of the concerns about the cognitive
processing of low achievers discussed by Pellegrino (Chapter 4, this volume).
Considerations for an AA-MAS
Page 249
Regardless of what approach to writing modified descriptors is taken, the articulation
across grades should be considered. Often when committees are working on drafting modified
achievement level descriptors, they are split into smaller groups to work on specific grade
levels. If this occurs, it will be important to spend time at the end of the workshop examining the
descriptors across all grades. Articulation will be improved if the committee members are asked
to consider whether they can see a clear progression across levels and how well these
descriptors translate to instruction.
Once the modified achievement level descriptors have been drafted, they will need to be
finalized by the state department of education and then approved by the state policymaker
(typically a board of education). When the state department of education is reviewing the draft
descriptors, they typically consider them as a whole, analyzing the consistency in rigor across
grades and subjects, the natural progression of difficulty from one grade to the next, and the
alignment between the descriptors and the test blueprints.
Setting Cut Scores
At first glance, it appears that any standard setting method that a state uses for its
general assessment would work for the modified assessment, particularly since most states
appear to be starting with their general assessment and applying various types of modifications.
However, there are additional considerations that come into play when selecting an appropriate
method for setting cut scores.
Keeping in mind that there may be some state policymakers who choose to develop a
brand-new assessment or to modify their AA-GLAS, we will start with the scenario that a state
has modified the general assessment. Almost all state general assessments are comprised
primarily of multiple-choice items with some states choosing to include some open-ended items
as well. With these types of tests, a test-based approach to standard setting is typically used.
Test-based approaches are those where the judgments are made about the test itself—usually
Considerations for an AA-MAS
Page 250
about individual items—rather than about the students or their actual performance. Another way
to think about the type of methods is based on the type of judgment required. According to
Zieky, Perie, & Livingston (2008), there are four types of standard setting judgments: (1)
judgments of test questions, (2) judgments of profiles of scores, (3) judgments of people or
products, and (4) judgments of groups of people. Examples of judgments of test questions
include methods such as Angoff or Bookmark. Methods involving judgment of profiles or scores
include Dominant Profile or the Performance Profile Method. Methods that require judgments of
people or products include Contrasting Groups and Body of Work. Methods that involve
judgments of groups of people are rarely used in the educational context.
While any of the three prominent types of judgments could apply to an AA-MAS, the
methods most appropriate for a test that is primarily comprised of multiple-choice items with a
few (or no) open-ended items include judgments of test items. This section will focus on the two
most common test-based methods—Angoff and Bookmark—and then discuss the feasibility of
using methods based on judgments of profiles, people, or products.
Test-Based Approach
Test-based approaches typically require standard-setting committees to make
judgments about test items. The two most commonly used methods for K–12 educational
assessments are the modified Angoff method and item mapping, typically the Bookmark
method. The applications of these two methods to set cut scores on the AA-MAS will be
discussed in this section.
Modified Angoff. The modified Angoff method (Angoff, 1971) is probably the most widely
used and best researched standard-setting method. In it, participants are asked to state the
probability that a borderline test taker (e.g., someone who is just barely proficient) would answer
each test item correctly. Summing the probabilities across all test items provides the test score for
a borderline test taker, which becomes the cut score for that achievement level. Typically, for a
Considerations for an AA-MAS
Page 251
multiple-choice test with four response options, we recommend that panelists limit their judgments
of probability to a range of 0.25 to 0.95. The reasoning is that even if the student has minimal
ability to answer the item correctly, he will have a 25% probability of answering it correctly by
chance (1-in-4). We limit the upper end to show that we never expect perfection from a student.
The only exception that panelists are given is if they think that one distractor will be so appealing
to a student with minimal knowledge that he is likely to be drawn to that distractor to the point that
he has a less than 25% chance of answering the item correctly, then they can provide a rating
below 0.25.
Now consider an AA-MAS where the revisions have included reducing the answer options
from four to three. In this situation, the student has a 1-in-3 (33.3%) probability of answering the
item correctly by chance, further restricting the range of possible judgments to 0.35–0.95. This
adjustment will almost certainly result in a higher cut score, which may not be desirable.
Another option for states wanting to stick to a modified Angoff approach is to use another
modification of the Angoff method—the yes/no method (Impara & Plake, 1997). In this option, the
judgment would be a simpler yes/no that the borderline test taker either would or would not
answer this item correctly. There have been some concerns raised that the yes/no method rounds
judgments too inaccurately (c.f, Reckase, 2006; Zieky, et al, 2008). For instance, a panelist who
feels that a borderline test taker who has a 25% chance of answering the item correctly would
record a 0. He would also likely record a 0 for an item he thought the borderline test taker had a
45% probability of answering correctly and another 0 for an item he thought a borderline test taker
had a 40%% chance of answering correctly, resulting in a cut score of 0 out of 3, whereas the
traditional Angoff would calculate a cut score of 1 out of 3. Thus, it would be reasonable to
consider adding in a guessing factor.
For example, if on a 50-item test a group of panelists agrees that the borderline Proficient
student would answer 23 items correctly, then the unadjusted raw cut score would be 23 out of 50
points. However, to adjust for guessing, we could then assume that of the remaining 27 items that
Considerations for an AA-MAS
Page 252
the student does not have the ability to answer correctly, they would answer 1/3 of them correctly
by guessing (assuming 3-option answer choices). Therefore, they would answer 23 items
correctly through their ability and 9 items correctly by chance, making the adjusted cut score 32
points out of 50. 4 This raw score cut can then be transformed to a scale score cut if desired.
Note that no change would be needed for applying an Angoff methodology to an openended item on an AA-MAS. The method most commonly used in K-12 assessments for the openended items is the mean estimate method, where the panelists estimate the mean (or average)
score a roomful of 100 borderline test takers would achieve. Those averages are then added to
the probabilities for the multiple-choice items (which are, in fact, averages of 0/1 scores) or to the
sum of 0s and 1s. Modification should not affect a panelist‘s ability to make this type of judgment
and no adjustment for guessing would be needed.
Item Mapping. Item Mapping approaches include Item Descriptor Matching (Ferrara,
Perie, & Johnson, 2008) and the more commonly used Bookmark method (Mitzel et al., 2001).
The Bookmark method was developed to be used with tests that are scored using Item Response
Theory (IRT). It is now one of the most widely used cut score-setting methods for state K–12
assessments. To use this method as it was designed, the state will need a test that was calibrated
using IRT and be able to order the items from easiest to most difficult based on the calibrations.
The panelist uses an ―Ordered Item Booklet‖ that displays the questions in order of difficulty
from easy to hard and is asked to place a bookmark at the spot that separates the test items
into two groups—a group of easier items that the borderline test taker would probably answer
correctly (with a response probability of 67, meaning a chance of at least 2 out of 3 or .67), and
a group of harder items that the borderline test taker would probably not answer correctly (i.e.,
4
This adjustment could result in a cut score higher than the panelist intended if they are not confident in
their judgment of the 1s. They should be instructed to record a 1 only if they feel the borderline test taker
would have a strong probability of answering this item correctly. Another option would be to substitute the
1s and 0s with probabilities before summing the judgments to calculate a cut score. For instance, the 0s
could be transformed to 0.33 and the 1s could be transformed to 0.95.
Considerations for an AA-MAS
Page 253
the test taker would have a probability of less than .67 of answering correctly). The bookmark
placement is then translated to an ability level of a student who has at least a .67 probability of
answer the items before the bookmark correctly and a less than .67 probability of answering
correctly the items after the bookmark. That ability level (or theta value) can be translated to a
scale score and mapped back to a raw score.
A concern with using this (or any item-mapping) method on an AA-MAS is in the item
ordering. Typically, an ordered-item booklet reflects a large population of students with a wide
degree of variance in their abilities. While their may be some ―distance‖ in the associated theta
values at extreme ends of the booklet, the majority of items are close enough together that it is
a fairly simple transformation to map a bookmark placement to an ability score. However, some
states have experienced difficulties with an ordered item booklet of a AA-MAS, where there was
not as much variation among test takers resulting in some clumping of item difficulties and areas
with large gaps in ability scores between the clumps.
For instance, let‘s supposed that in a traditional Bookmark item map, items 10–16 have
associate theta values of 1.02, 1.04, 1.05, 1.05, 1.07, 1.08, and 1.10. Although there are
different methods for selecting the actual cut point (theta value of the item that is bookmarked,
the theta value of the item before it, or the mean of those two values), it is relatively
straightforward to determine the cut score value for a bookmark that is placed at any of those
items. But what if the items had theta values of 1.02, 1.02, 1.03, 1.42, 1.42, 1.43, and 1.67? If
the bookmark is placed on item 13 (the fourth value in the string) indicating that the 13th item is
the first one that a borderline test taker would not have a 0.67 probability of answering correctly,
what should the cut score be? Given the three methods usually used to determine the cut score,
this one cut score could be assigned a value of 1.42, 1.03, or 1.225. These are fairly disparate
numbers and could result in very different scale score or raw score cuts.
Considerations for an AA-MAS
Page 254
Therefore, before choosing to use an item-mapping approach, it is important to consider
the size and variance of the population taking the AA-MAS. That is, be sure that there are
enough students taking the test and enough variance in that population of students for the items
to both scale well and order sensibly. Theoretically, it may be more feasible for a state the size
of Texas to use an item-mapping approach to set cut scores on the AA-MAS than a state the
size of Delaware.
An alternative for states who are worried that their samples are too small or too
homogeneous is to vary a traditional item-mapping approach using classical measurement
theory rather than IRT. Actually, the process described here is similar to a yes/no Angoff except
that the items are ordered by difficulty, as in a traditional item-mapping approach.
The approach involves ordering the items and placing them into an ordered item booklet,
as in the Bookmark approach; however, p-values rather than IRT difficulties are used to
determine the order. Then, ask the panelists to start with the easiest item and simply ask ―would
a borderline Proficient student be able to answer this item correctly?‖ If the answer is yes, then
they move to the next item. When they reach an item that they answer ―no‖ to, that is where
they place their bookmark. As with all Bookmark procedures, we recommend that the panelists
continue a little further into the booklet to ensure that the bookmarked item is truly the beginning
of the more difficult items and not an anomaly. Then, rather than transforming the bookmark to a
difficulty estimate, simply count the number of items before the bookmark and use that number
as the initial raw score cut. For instance, in a 50-item booklet, if the panelist places their
bookmark on item 22, then the initial cut score would be set at 22 out of 50 raw score points.
Again, it is worth adjusting this cut score for guessing. If this booklet contained only multiplechoice items with 4-option answers, then a borderline test taker would have a 1-in-4 chance of
Considerations for an AA-MAS
Page 255
answering the remaining 28 items correctly by guessing. So, we would add 7 raw score points
to the cut score for a final cut score of 29 out of 50 points.5
Other Standard-Setting Approaches
As mentioned earlier, there are other standard-setting approaches that may be worth
considering, particularly if the test design includes more than multiple-choice items. At least one
state is developing an AA-MAS that involves collecting student evidence on each content
standard assessed. The result will look more like a portfolio assessment than a traditional
paper-and-pencil assessment. So, it is important to consider other standard-setting methods for
these alternate approaches. Three methods we will discuss here are the Body of Work
(Kingston, Kahl, Sweeney, & Bay, 2001), Analytic Judgment (Plake & Hambleton, 2001),
Dominant Profile (Plake, Hambleton, & Jaeger, 1997), and Contrasting Group (Livingston &
Zieky, 1982) methods.
Body of Work. The Body of Work method falls under the category of judgments of people
and products and requires some type of evidence for the panelists to consider. Zieky, et al.
(2008) lists this as a type of Contrasting Groups approach that focuses on categorizing student
work rather than the students themselves. The method is designed for tests with performance
tasks or tasks that yield observable products of a student‘s work, such as essays or recorded
speech or science experiments. This is a popular method for the portfolios often used for the
AA-AAS. It would also be suitable for a design that requires students to submit evidence of
achievement for each assessed content standard. The method does not work well for tests that
5
Note that if the booklet contained open-ended items, they could not be answered correctly by
chance and would not be figured into the adjustment. For instance, if 8 of the 28 remaining ―items‖ in the
booklet represented various point values for open-ended items, we would simply calculate the probability
of guessing correctly on the 20 multiple-choice items, adding 5 points to the initial raw score cut.
Considerations for an AA-MAS
Page 256
include large numbers of multiple-choice questions, but it will work if there are a few multiplechoice questions with the performance tasks.
The panelists are asked to review a full body of evidence (meaning the responses to all
test questions) and make a single judgment about the entire set of responses, matching the
knowledge and skills exhibited in the responses to the knowledge and skills required to be in an
achievement level. The cut score between two performance levels is chosen by finding the point
on the score scale that best distinguishes between the sets of evidence placed in each of the
achievement levels.
Analytic Judgment. The Analytic Judgment method is also a method where judgments
are made on products; however, judgments are made on responses to individual items (or
groups of related items) rather than on the product as a whole. It was designed to be used with
tests made of several essay or performance tasks. The method will work for tests that include
some multiple-choice items with the performance tasks as long as the items can be grouped
into meaningful content clusters.
The Analytic Judgment method begins by asking panelists to review samples of test
takers‘ work. As described in Zieky, et al. (2008), it is similar to the Body of Work method, but
there are two distinct differences:
1. Panelists make judgments on test takers‘ responses to individual items or to clusters of
related items rather than to the entire body of evidence at once, and
2. In addition to classifying a response into an achievement level, panelists further classify
the responses at each performance level into low, middle, and high categories. For
example, a response is not simply classified as Proficient. It is, in addition, classified as
low Proficient, middle Proficient, or high Proficient.
The result is a cut score for each item or group of related items; this cut score is the
score that most clearly distinguishes between the best responses in the lower achievement level
and the worst responses in the higher achievement level (e.g., between responses classified as
Considerations for an AA-MAS
Page 257
high Basic and low Proficient.) Those are the responses that are close to the borderline of each
achievement level. The cut scores for all items or all groups of items are summed to get the cut
score for the total test.
Dominant Profile. The Dominant Profile is a method based on profiles of scores and
typically results in a conjunctive cut score. That is, the test is divided into meaningful parts that
measure different knowledge and skills, and a cut score is determined for each part separately.
Thus the outcome is not a single cut score but a set of rules. Those rules can specify separate
cut scores for each content strand, or there can be a single cut score for the total score with a
minimum score on certain components.
The panelists‘ task is to become familiar with the test, how it is scored, and with the
meanings of the different strands/components. They then work together to specify rules for
determining which combinations of scores represent acceptable performance and which do not.
The rules can combine information from the scores of different components in various ways, as
in the following example:
A mathematics test is divided into 5 strands with 20 points per strand. The panelists
determine the follow set of rules to be used before classifying a student as Proficient:
No score below 10 on any component
At least one score of 15 or higher
A total score of at least 60 points
Contrasting Groups. Finally, what if a test developer is in a position where cut scores
need to be set, but the data are not yet available, and there is no rubric or student work to
analyze? The original contrasting groups method involves judgments about test takers (Zieky &
Livingston, 1982). The judgments can be made prior to the test administration and then
compared to the actual scores received to calculate the cut score. The method involves
identifying teachers familiar with the target population and then training them on the meaning of
the achievement level descriptors, paying particular attention to differentiating between high
Considerations for an AA-MAS
Page 258
performance on the lower level and low performance on the higher level. This training does not
have to be done in person (videotapes work well), as the method typically works best when
there are large numbers of teachers involved (at least 100 per cut score). Once the teachers
have been trained, the test developer asks them to place each of their students who will be
taking the AA-MAS into one of the achievement levels based on their experience with those
students. Once the students have taken the test, they are assigned a total score (either a raw
score or a scale score will work for this method). Then, the distribution of scores across
assigned achievement levels can be examined to determine the best cut score for each level.
For instance, for each cut score, the percentage of students scoring at the higher of the two
levels can be plotted against the score. That is, for the basic/proficient cut, plot a graph with the
range in total cut scores along the x-axis, and the percentage of students at each of those levels
categorized as Proficient by their teachers on the y-axis. Then, choose the cut score for
proficient based on the percentages. Zieky, et al. (2008) recommends that ―one reasonable
choice for a cutscore would be the score at which 50 percent of the test takers are [categorized
as] Proficient because that would represent the borderline of the Proficient performance level
(page 78).‖ Another procedure is to plot the distributions of scores for two adjacent levels (e.g.,
basic and proficient) and set the cut score at the point at which two distributions overlap.
Because this method is based on the judgments of teachers about students they know, it
is a reasonable way to match students to achievement levels, but it also introduces some bias.
Teachers may factor other considerations into their judgments, such as effort and likability,
when the judgment should truly be about the student‘s knowledge, skills, and ability. This
method is often used to check a cut score set through a method based on judgments of test
items. This check can be done a couple of years after the initial standard-setting workshop once
teachers have become very familiar with both the test and the meaning of the achievement
levels.
Considerations for an AA-MAS
Page 259
Linking Tests through Cut Scores
A final option for consideration is linking the AA-MAS to the general assessment
through the cut scores. Although this idea will be discussed more thoroughly in Abedi (Chapter
9, this volume), it is worth introducing here. Some state policymakers have suggested linking
the ―advanced‖ or ―proficient‖ level of the modified achievement levels to the ―basic‖ level of the
general grade-level achievement levels. One option would be to link the assessments
statistically with common items taken by both populations (as described by Welch & Dunbar,
Chapter 6, this volume), but another option is to link the assessments judgmentally.
A judgmental linking is where a standard setting method is applied to make the
―advanced‖ level of one test equivalent to the ―basic‖ level of another. There are several ways to
do this, but the best is to use many of the same panelists in both standard settings. Start by
having the panelists become thoroughly familiar with the ―basic‖ level of the general
assessment, both by reviewing the grade-level achievement level descriptor and by examining
exemplar items and/or student work at that level. Then, the modified achievement level
descriptor for advanced (or proficient) would need to be matched to the grade-level
achievement level descriptor for basic. Preferably, the descriptors would be exactly the same,
with only slight modifications to allow for the use of the scaffolds that may have been built into
the assessment. The judgmental task most commonly used is an item mapping approach where
the panelists would work through an ordered item booklet to find the cut score that would allow
for the same interpretation of knowledge and skills across the two assessments.
Final Considerations
Although the greatest challenges for developing modified achievement standards lie in
defining proficiency for this population and applying an appropriate standard-setting
methodology to set a cut score, we would be remiss if we did not discuss the importance of
documentation and validity studies. Proper documentation is important for any testing program
Considerations for an AA-MAS
Page 260
and mandated by the peer review guidance. Likewise, we should always be thinking of the
validity of the interpretations made using the achievement standards, and peer review requires
plans for validating the assessment and the inferences made from the results.
Documentation
It is important to document both the process of developing modified achievement level
descriptors and the standard-setting procedures. Two professional Standards (AERA, APA, &
NCME, 1999) directly address the importance of documenting the rationale, procedures, and
results:
Document the PLDs, selection of panelists, training provided, ratings, and variance
measures (Standard 1.7)
Document the rationale and procedures for the methodology used (Standard 4.19)
As discussed in Perie (2007), there are eight important components that need to be
documented regarding the standard-setting process:
1. Achievement level descriptors
2. Panelists
3. Rationale
4. Training
5. Procedures
6. Ratings and variance
7. Any adjustments and adoption of cut scores
8. Validity evaluation
Most of these are fairly straightforward and discussed in several texts on standard
setting. Here, we will highlight only two areas that may have particular sensitivities for modified
achievement standards and have been discussed within this chapter.
Considerations for an AA-MAS
Page 261
Achievement Level Descriptors. Because of the challenges associated with describing
proficiency using a modified achievement standard, it is vital that the test developer both
describe and justify the process used, including the selection of participants who may have
drafted the descriptors, the directions given to them, the data or information used to inform the
process, and the number and type of reviews conducted before the descriptors were formally
adopted. Providing a theory of who the students and what the barriers are to their achieving
grade-level standards will aid the understanding of how the descriptors were developed.
Rationale. It will be important to document the rationale for selecting the standard setting
method used to set the cut scores. If the revisions or enhancements made to the assessments
(e.g., the reduction of a response option) or the characteristics of the population (e.g., small
variance in performance) affected the choice of available methods, this could be explained in
writing to better help a reader understand the purpose and logic. Explaining the rationale behind
any selection of a process helps inform the validity argument as discussed in the next section
and in Marion (Chapter 8, this volume). Finally, if any modifications to the traditional application
of the standard-setting method—such as those described in this chapter—were made, these
need to be documented as well along with the rationale for these modifications.
Validation
Validity is a large topic that will be covered more completely in Marion (Chapter 8, this
volume), but it is worth touching on the various types of evidence that can be collected during
the standard-setting process here. Collecting the information discussed in the documentation
section can provide evidence of internal validity of the achievement standards. Providing a
rationale for the methods used, ensuring an appropriate panel composition, comparing the
results to other external sources can all provide validity evidence to the argument that the
achievement standards were set appropriately. Then, thought needs to be given to how the
Considerations for an AA-MAS
Page 262
interpretation and use of the achievement standards contribute to the consequential validity of
the assessment.
In examining the validity of the use of the achievement standards, it is important to ask a
series of questions about the basic components of those standards. Conducting a series of
studies over the first several years of the assessment can provide information to answer these
questions on issues regarding appropriateness of the modified achievement level descriptors
and the accuracy of the cut scores. For example, questions may include:
Was the standard-setting procedure internally valid?
Do the cut scores divide students reasonably in terms of achievement?
How well does the test classify students compared to their achievement in the
classroom?
Do the effects of the achievement standards match what was intended?
Have the modified achievement level descriptors had an impact on instruction?
Have there been any negative consequences of using these achievement standards?
Some of these questions can be answered through the standard-setting process itself. It
will be important to show that the panelists were qualified and representative of all possible
panelists. Evaluation forms can be used to show that the panelists understood the process and
were confident in the results. If feasible, working with two separate panels during the standardsetting process will also provide a measure of consistency in cut score recommendations and
provide evidence of validity. To argue for the reasonableness of the cut scores, the test
developer can compare the percentage of students categorized into each achievement level by
the AA-MAS to the percentages in the equivalent categories by the general assessment and the
AA-AAS. If all tests are intended to be developed to the same rigor for their specific populations,
then one would expect the impact data to be distributed similarly across all assessments.
Other questions can be answered through teacher surveys and focus groups as well as
classroom observations. Conducting a contrasting groups study a year after the cut scores are
Considerations for an AA-MAS
Page 263
set can also provide useful interpretative information. Once the test developers are confident
that teachers know and understand the modified achievement level descriptors, they could ask
the teachers to classify their students into one of the four achievement levels prior to the
assessment. Then the classifications determined by the assessment could be compared to the
teacher classifications to see if the teachers would generally assign students into higher or
lower categories or if the two sources of data provide similar classifications.
As a final thought, it is important to keep in mind that the process of setting achievement
standards does not end with the cut score study or even with State Board approval of the
descriptors and cut scores. Instead, consider designing a mechanism within an assessment
program to continually monitor the effectiveness and appropriateness of the achievement level
descriptors and the usefulness of the categories as defined by the cut scores. Particularly for
this population where we expect instruction to continually improve and move closer to gradelevel instruction, it is important to frequently monitor the efficacy of the modified achievement
level standards.
Considerations for an AA-MAS
Page 264
References
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.)
Educational measurement (second edition, pp. 508–600). Washington, DC: American
Council on Education.
Beck, M. (2003, April). Standard setting: If it is science, it‘s sociology and linguistics, not
psychometrics. Paper presented at the annual meeting of the National Council on
Measurement in Education, Chicago, IL.
Berk, R. A. (1996). Standard setting: The next generation (where few psychometricians have
gone before!). Applied Measurement in Education, 9 (3), 215–235.
Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating
performance standards on tests. Thousand Oaks, CA: Sage.
Ferrara, S., Perie, M., Johnson, E. (2008). Matching the judgmental task with standard setting
panelist expertise: The Item-Descriptor (ID) Matching procedure. Journal of Applied
Testing Technology, 9(1).
Gong, B. (2007). Learning Progressions: Sources and Implications for Assessment. Presentation
at the CCSSO Large-Scale Assessment Conference, Nashville, TN, June 2007.
Hambleton, R. H. (2001). Setting performance standards on educational assessments and
criteria for evaluating the process. In G. J. Cizek (Ed.) Setting performance standards:
Concepts, methods, and perspectives (pp. 89–116). Mahwah, NJ: Lawrence Erlbaum.
Impara, J. C., & Plake, B. S. (1997). Standard setting: An alternative approach. Journal of
Educational Measurement, 34(4), 353–366.
Kingston, N. M., Kahl, S. R., Sweeney, K. P., & Bay, L. (2001). Setting performance standards
using the body of work method. In G.J. Cizek (Ed.), Setting performance standards:
Concepts, methods, and perspectives. Mahwah, NJ: Lawrence Erlbaum Associates.
Livingston, S., & Zieky, M. (1982). Passing scores: A manual for setting standards of
performance on educational and occupational tests. Princeton, NJ: ETS.
Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001). The Bookmark procedure:
Psychological perspectives. In G.J. Cizek (Ed.) Setting performance standards:
Concepts, methods, and perspectives. Mahwah, NJ: Lawrence Erlbaum Associates.
Perie, M., (2008). A guide to understanding and developing performance level descriptors.
Educational Measurement: Issues and Practice, 27(4) pp. 15-29.
Perie, M. (2007). Setting alternate achievement standards. Lexington, KY: University of
Kentucky, Human Development Institute, National Alternate Assessment Center.
Available online at: http://www.naacpartners.org/products/whitePapers/18020.pdf
Considerations for an AA-MAS
Page 265
Perie, M., Hess, K., & Gong, B. (2008). Writing Performance Level Descriptors: Applying
Lessons Learned from the General Assessment to the 1% and 2% Assessments. Dover,
NH: National Center for the Improvement of Educational Assessment. Available at
www.nciea.org.
Plake, B. S., & Hambleton, R. K. (2001). The Analytic Judgment method for setting standards
on complex performance assessments. In G. J. Cizek (Ed.), Standard setting: Concepts,
methods, and perspectives (pp. 283–312). Mahwah, NJ: Erlbaum.
Plake, B. S., Hambleton, R. K., & Jaeger, R. M. (1997). A new standard setting method for
performance assessments: The Dominant Profile Judgment method and some field-test
results. Educational and Psychological Measurement, 57, 400–411.
Reckase, M. D. (2006). A conceptual framework for a psychometric theory for standard setting
with examples of its use for evaluating the functioning of two standard setting methods.
Educational Measurement: Issues and Practice , 25 (2), 4–18.
U.S. Department of Education. (2007). Final Rule 34 CFR Parts 200 and 300: Title I—Improving
the academic achievement of the disadvantaged; Individuals with disabilities education
act (IDEA). Federal Register. 72(67), Washington DC: Author. Available at:
http://cehd.umn.edu/NCEO/2percentReg/Federal-RegApril9TwoPercent.pdf
Zieky, M., Perie, M., & Livingston, S. (2008). Cutscores: A manual for setting performance
standards on educational and occupational tests. Princeton, NJ: Educational Testing
Service.
Considerations for an AA-MAS
Page 266
SECTION III
TECHNICAL CONSIDERATIONS AND PRACTICAL APPLICATIONS
This final section incorporates the overarching themes of comparability and validity of
these assessments and then focuses on how the AA-MAS will fit into a state accountability
system. Its original intent was to provide information on examining the technical adequacy of
these assessments as a logical follower of the section on assessment design. However, it soon
became clear that specific issues needed to be addressed, far beyond most technical
considerations of item analysis, reliability and equating. Thus, a previous chapter (Welch &
Dunbar, Chapter 6, this volume) began the discussion of technical adequacy, focusing on item
analyses and psychometric characteristics of the test. This section, then, focuses on very
specific questions regarding the technical quality of this assessment, not as a standalone
assessment, but as it fits into a larger assessment program.
Chapter 8, by Jamal Abedi discusses issues related to the comparability of the AA-MAS
from the perspective of ensuring students who take this assessment should have the same
opportunities for success and inclusion as students who take the general assessment. Several
components of comparability are examined, including content and construct, psychometrics,
scale and score, linguistic structure, basic text features, depth of knowledge and
accommodations used for students with disabilities based on their IEP.
Chapter 9, by Scott Marion discusses the importance of developing a validity argument
for the implementation and use of the AA-MAS. He emphasizes the importance of articulating
the theory of action particularly in light of the uncertain conceptual framework supporting this
AA-MAS initiative. He then describes methods for evaluating the argument to provide
information about how to improve the program and how to determine the value of AA-MAS in
terms of the instructional and social benefits given the costs.
Considerations for an AA-MAS
Page 267
Finally, in Chapter 10 by Chris Domaleski, the focus turns to the practical application of
these ideas in a state assessment system. He describes issues of fitting this assessment into a
pre-existing state assessment and accountability system. He considers the current state context
in reviewing operation considerations and discusses ways to estimate reliability, produce
informative score reports, and consider options related to diploma eligibility.
The authors in this section received advice and guiding comments from the expert panel
members who reviewed these chapters. In particular, comments from Katherine Ryan, Phoebe
Winter, Brian Gong, Suzanne Lane, and Howard Everson were valuable and served to inform
the final drafts of the chapters.
Considerations for an AA-MAS
Page 268
CHAPTER 8
COMPARABILITY ISSUES IN THE ALTERNATE ASSESSMENT BASED ON MODIFIED
ACHIEVEMENT STANDARDS FOR STUDENTS WITH DISABILITIES
Jamal Abedi
The mandate of including students with disabilities in state and national assessments
may not produce desirable results if the assessment outcomes for these students are not
comparable with the assessment outcomes for mainstream students. Thus, comparability issues
for these students must be given careful attention if these students are to be given a fair chance
of inclusion in the assessment and accountability system. The principle of comparability and its
related issues have long been debated. In this chapter, the concept of comparability is viewed
and discussed in broader terms and from different perspectives, including content and
construct, psychometrics, scale and score, linguistic structure, basic text features, depth of
knowledge, and accommodations used for students with disabilities based on their IEP. It is
indicated that comparability is not an ―all-or-none‖ proposition; rather it is a continuum of varying
degrees. Recommendations have been provided for the State of New York on how to view and
evaluate comparability between alternate assessments based on modified achievement
standards (AA-MAS) and general assessments.
Rationale
Recent legislation such as the reauthorization of the Individuals with Disabilities
Education Act and the No Child Left Behind Act of 2001 mandates inclusion of students with
disabilities in the assessment and accountability systems (Domaleski, Chapter 10, this volume;
Gong & Blank, 2002; Lowrey, et al, 2009; Thompson, Lazarus, Clapper & Thurlow, 2006). This
mandate is based on the assumption that the same, or at least comparable, assessments are
used across groups of students, those with different types of disabilities and those without any
apparent disabilities. In the context of assessment, comparability means that the inferences
from the scores on one test can be psychometrically related to a score on another ―comparable‖
Considerations for an AA-MAS
Page 269
test (Marion, 2006). In other words, comparability assumes equivalence between the
assessments (Elosua & Lopez-Jhuregui, 2008). While these definitions provide one aspect of
comparability, they emphasize the importance of comparability of AA-MAS with the general
assessments as the policy of inclusion may not produce valid outcomes if the assessments
used for different subgroups of students do not have the same meaning and do not lead to the
same interpretation across these subgroups. In this chapter, issues concerning comparability of
assessments for students with disabilities taking alternate assessment based on modified
achievement standards (AA-MAS) are discussed and methods for examining such comparability
are described. The focus on comparability in this chapter centers on the application of AA-MAS
assessments for students with disabilities in the State of New York.
The majority of students with disabilities take the general state assessments, with or
without accommodations. However, a small group of students with disabilities—who can make
significant academic progress but who are not able to achieve grade-level progress—may not
be able to show the full range of their knowledge and skills on the general assessments even
with accommodations. Therefore, they are offered alternate assessments (Lazarus, Rogers,
Cormier & Thurlow, 2008). These alternate assessments have been described as the ―ultimate
accommodation‖ for inclusion of students with significant disabilities in the accountability system
(Domaleski, Chapter 10, this volume; Roach, 2005).
However, there are major questions and concerns regarding the purpose, design,
development, implementation, and interpretation of the outcomes of these assessments. For
example, Kettler and Almond (2009) raise many questions regarding these assessments:
―First and foremost, which students should be eligible for an AA-MAS? Second, what
are their unique learning characteristics, and how should an assessment be tailored
to their needs based on a better understanding of their cognitive processing?‖ (p. 5)
The authors also raised questions related to item and test development which include:
Considerations for an AA-MAS
Page 270
―(a) What characteristics make an item or test more accessible? (b) How might
changes in test delivery and format interact with altered items? (c) At what point does
an alteration to an item affect the construct being measured? (d) How is alignment to
the content standards affected by item and test alterations? (e) How do proficiencylevel descriptions affect the development of AA-MAS? (f) What criteria should be
used to judge student success? (g) How do alterations designed to change the
complexity and difficulty of items affect the technical quality of AA-MAS as complete
tests?‖ (ibid, p. 5)
There are also major issues with the standard settings for AA-MAS used for students
with disabilities for the 2% student group. For example, how comparable should the cut scores
be set for different performance level and how should these cut scores be defined? (Olson,
Mead, & Payne, 2002). Answers to these questions require substantial efforts in conducting
research in the area of alternate assessments for students with disabilities.
Different forms of alternate assessments have been proposed. Among them are: 1)
alternate assessments based on grade-level achievement standards (AA-GLAS), 2) alternate
assessments based on alternate achievement standards (AA-AAS), which are usually referred
to as the 1% group, and 3) alternate assessments based on modified achievement standards
(AA-MAS), often referred to as the 2% group (Gong, 2007). Elliot and Roach (2007) underscore
the importance of determining effective strategies for including special needs students in the
overall accountability for student achievement, stating:
―Alternate assessments are used with a relatively small population of students with
disabilities, yet demand a significant amount of time from educators and state
assessment professionals to develop, implement, and evaluate. It appears the efforts
of these professionals will need to be extended given the vast majority of states‘
have not met the USDOE‘s requirements for alignment and technical soundness.‖
(pp. 330-1)
Considerations for an AA-MAS
Page 271
Different chapters in this volume address some of the issues raised above. For example,
Quenemoen (Chapter 2, this volume) discusses eligibility criteria for students taking AA-MAS.
She distinguishes between low-performing students who have disabilities and those with no
apparent disabilities. The chapter by Welch and Dunbar (Chapter 6, this volume) discusses
issues concerning the development of AA-MAS and the advantages and disadvantages of
various options for modifications, and this chapter focuses on the comparability aspect of AAMAS.
Challenges in Evaluating Comparability
Developing alternate assessments for students with disabilities is quite complex and
requires special attention and planning. For example, Lowery et al. (2009) suggest ―adherence
to the requirement to maintain an individualized, meaningful curriculum for students with severe
disabilities complicates delivery of an assessment that is created to measure progress of
students toward a standardized curriculum‖ (page 250). The authors indicate that the use of
different approaches by states (e.g. such as simplifying general education standards, redefining
them as functional skills, or extending them through the use of foundational skills) brings further
complications in the process of developing alternate assessments. Roach (2005) discusses and
examines four challenges in designing and implementing alternate assessments for students
with significant disabilities. These challenges include: 1) deciding who should participate in
alternate assessments, 2) determining the content area that alternate assessments should
measure, 3) creating reliable and valid alternate assessments, and 4) defining proficient
performance on alternate assessments. While some of these challenges (such as challenge 2
which only applies to AA-AAS) may not apply to AA-MAS, but they emphasize the difficulty in
developing these assessments and interpreting their scores.
The U.S. Department of Education (USED) announced the proposed regulation for the
AA-MAS in 2005 (Kettler & Almond, 2009). Subsequently, USED provided Peer Review
Considerations for an AA-MAS
Page 272
Guidelines for conducting reviews of state assessment systems including alternate
assessments based on modified achievement standards (U.S. Department of Education,
December 2007). Based on one USED report (U.S. Department of Education, November 2008),
eight states have developed AA-MAS for at least one grade level. The report by the National
Technical Advisory Council indicated that ―seven of these states have submitted evidence to the
Department for peer review but none has met all the requirements‖ (page 4). One of the main
reasons that states were not able to provide sufficient evidence on the comparability of
assessments for AA-MAS is that some of the students who face the most challenges in their
educational careers belong to subgroups that are small in size. It would be extremely difficult for
researchers to examine the factors affecting comparability between AA-MAS and general
assessments using traditional research/psychometric methodologies due to such small group
sizes. In order to do a comparison between students who take general assessments with those
taking alternate assessments, large enough samples are needed in order to detect meaningful
differences.
In some categories of low incidence disabilities, there are hardly enough subjects in a
school, district, or even in most states to allow for meaningful analyses of data to examine
comparability issues. In such cases, researchers may be required to combine some of these
categories in order to obtain a large enough sample to conduct studies that are
methodologically sound. However, research suggests that issues concerning assessment of
students with disabilities might vary across different categories of disabilities; therefore, it may
not be reasonable to aggregate findings from students in the different subgroups of disabilities
(see, for example, Abedi, Leon & Kao, 2008).
The Standards for Educational and Psychological Testing (American Educational
Research Association [AERA], American Psychological Association [APA], & National Council
on Measurement in Education [NCME], 1999) view comparability as a major foundation
underlying valid and fair assessments and allocate an entire chapter to issues regarding
Considerations for an AA-MAS
Page 273
comparability (Chapter 4, pp. 49–60). However, the main focus of the Standards’ Chapter 4 is
on score comparability, which can be established through approaches such as scaling test
scores. The Standards state (p. 49): ―Scale scores may aid interpretation by indicating how a
given score compares to those of other test takers by enhancing the comparability of scores
obtained using different forms of a test, or in other ways.‖
However, the Standards acknowledge the limitations on the score comparability,
particularly when implemented in terms of cut score6 (p.50): ―Criterion-referenced interpretations
based on cut scores are sometimes criticized on the grounds that there is very rarely a sharp
distinction of any kind between those just below versus just above a cut score.‖
Approaches to Comparability
In order to address the current need and develop strategies to overcome the challenges,
the discussion of comparability in this chapter goes beyond the traditional approach including
those discussed in the Standards. In addition to score comparability, the chapter discusses
comparability in several other areas. Specifically, this study proposes comparability in six major
areas: (a) content and construct, (b) depth of knowledge, (c) accommodation, (d)
psychometrics, (e) linguistic structure, and (f) basic text features. However, we acknowledge the
challenging task of establishing comparability in all six areas. Therefore, we group these
comparability features into two categories, (1) required comparability (features ―a‖ through ―c‖),
and (2) complementary or desired comparability (features ―d‖ through ―f‖). To assure
comparability, the test developers must present evidence on the first category, and if feasible,
with supplemental (preferred) evidence from the second category. This recommendation of two
broad categories is based on literature and experts‘ opinion (see for example, Allen & Yen,
1979; Thorndike, 2005). However, in many cases, states determine where they would like the
6
For a detailed description of cut scores, see Perie (Chapter 7, this volume).
Considerations for an AA-MAS
Page 274
tests to be comparable. For example, the AA-MAS may be a better measure of achievement at
the lower end of the scale, but states may want the ―proficient‖ level to be comparable across
AA-MAS and general assessments.
Under content and construct, issues concerning achievement standards and proficiency
judgments are discussed. The psychometric comparability section provides a discussion of the
classical measurement approach in examining comparability of assessments. Under this
section, a discussion of reliability and validity as well as scale and score comparability is
presented. This section also presents a discussion on structural equation modeling and
differential item functioning (DIF) approaches in examining comparability. Under linguistic
comparability, a description of grammatical complexity, lexical density, and text length is
presented. The comparability in the basic text features includes a discussion of comparability in
terms of format, tables, charts, graphs, and pictures. The depth of knowledge section provides a
description of a theoretical underpinning of depth of knowledge in the context of comparability
and suggests ways to compare the level of depth of knowledge across the two assessments
(AA-MAS and the general state assessments). Finally, this chapter reviews comparability in
terms of accommodations used for students with disabilities based on their IEPs.
Content and Construct Comparability between AA-MAS and General Assessments
The first and most important criterion for examining comparability of different
assessments is to establish content and construct comparability. Assessments that measure
different content and constructs may not produce comparable outcomes even if they are shown
to be comparable in terms of psychometric characteristics. The concept of content and construct
comparability has been discussed from different points of view, including expert judgment,
moderation, and alignment with the grade-level content standards. Thus, comparability between
AA-MAS and general assessments can be established through expert judgment, moderation by
Considerations for an AA-MAS
Page 275
inspection, social moderation (Winter, 2009), and alignment with the grade-level content
standards.
The concept of cognitive demand in assessment is related to the discussion of content
and construct comparability. The level of cognitive demand of an assessment (or of an item
within an assessment) could be determined by different sources some of which are relevant to
the assessments and some of them may be due to the impact of nuisance variables or construct
irrelevant sources. For example, irrelevant or poorly labeled visuals may increase the cognitive
load of perceiving information for students with disabilities. However, cognitive complexity might
be a relevant factor in the assessment. Similarly, in reading comprehension, items that are
inferential may significantly increase the cognitive load for students with disabilities and, thus,
affect students‘ ability to display their understanding of the passage. Assessing depth-ofknowledge reveals the level of the cognitive demands of the standards and the cognitive
demands of the assessment items. Level 4 of depth of knowledge (extended thinking) requires
the highest level of cognitive demand in Webb‘s model. This level demands complex reasoning,
planning, developing, and thinking (Webb, et al., 2006).
Expert judgment. Content comparability can be established through experts‘ judgment
(Mislevy, 1992). A team of experts, including content specialists, teachers, and linguistic
experts, could judge the comparability of content across the two assessments. For expert
judgment a rubric is often developed and validated to help ensure more consistent judgment
across a variety of experts with different backgrounds. To estimate interrater reliability,
comparability between the two assessments may be examined by more than one person.
Interrater reliability indices such as kappa and intra-class correlations can then be computed
and can be compared across the two assessments.
Moderation. Moderation refers to the identification of local scoring instances that are
overly stringent or overly lenient to ―moderate‖ those scores to bring them more into line (Burton
& Linn, 1994). ―Moderation‖ techniques can be grouped into several categories. A commonly
Considerations for an AA-MAS
Page 276
used approach is classified as moderation by inspection or cross-moderation, which is mainly
based on judgmental audits. Another moderation approach is based on statistical moderation.
Under this approach, moderation is done based on external criteria. The third approach is the
enhancement of one of the two approaches mentioned above or a combination of the two
approaches (Burton & Linn, 1994; Linn, 1993; Mislevy, 1992).
Alignment with the grade-level content standards. States conduct alignment studies to
demonstrate how and to what extent their assessments are aligned with their content standards
(see, for example, Moore & O‘Neil, 2004). Alignment is conducted to examine the degree of
correspondence between a set of educational standards—often referred to as state content
standards—and the assessments that are developed to measure what students are expected to
learn in relation to those standards (Moore, & O‘Neil, 2004; Webb, 1999, 2002). According to
Webb, there are several major criteria for alignment. These criteria include: a) categorical
concurrence, b) depth-of-knowledge consistency7, c) range-of-knowledge correspondence, and
d) balance of representation (Webb, 1999). Studies suggest that Webb‘s alignment model, used
for the alignment of assessment content with the state content standards for regular state
assessments, can be meaningfully applied to alternate assessments, which provide states a
way to comply with the requirements of IDEA and NCLB (Roach & Elliott, 2004; Gong & Marion,
2006; Tindal, 2005). Tindal (2005) describes procedures for alignment of alternate assessments
using the Webb alignment model.
In fact, the report on the peer review results from six states suggests that test blueprints
should provide evidence on the alignment between the AA-MAS and grade-level content
standards (Filbin, 2008; Kettler & Almond, 2009). These assessments are required to assess
the same breadth and depth as the general assessments.
7
A more detailed discussion of the depth of knowledge alignment will be presented later in this chapter.
Considerations for an AA-MAS
Page 277
Psychometric Comparability between AA-MAS and General Assessments
Psychometric comparability data can serve as complementary and supportive evidence
to the content and construct comparability. In this section, psychometric comparability is
discussed in the context of both classical and modern theory of measurement.
Classical Measurement Approach in Examining Comparability of Assessments. Under
the classical test theory, assessment outcomes can be considered comparable if they are from
parallel or tau-equivalent tests. To consider different forms of assessments as parallel or tauequivalent, certain assumptions underlying parallel and tau-equivalent tests must be met. The
main assumption underling classical test theory is that the measurement error is randomly
distributed and that the correlation between measurement errors of two tests is zero (
E1E2
= 0).
This implies that the correlation between the true scores of form A of the test with measurement
error of form B of the test is zero (
T1E2
= 0). Additionally, if two tests have observed score of X
and X‘ that satisfies the assumption of randomly distributed measurement error, and if, for every
population of examinees, the true score of test 1(T) equals the true score of test 2 (T‘), and if the
variance of measurement error of test 1 (
(
2
E‘),
2
E)
equals the variance of measurement error of test 2
then the tests are considered parallel tests (Allen & Yen, 1979; Thorndike, 2005).
However, as indicated by the U.S. Department of Education (2007) and in the literature,
AA-MAS assessments differ from states‘ general assessments in many different aspects. Some
of these assessments include fewer items with higher p-values (less difficult items), have
shorter and fewer reading passages, have less complex linguistic structures, and use fewer
distractors in their multiple choice items (Cortiella, 2007; Kettler & Almond, 2009; Lazarus, et al.,
2007). Such systematic differences between states‘ alternate and general assessments create
major limitations on the comparability of the two assessments.
One question is whether a shorter version of the test can be considered as parallel (tauequivalent) to the full version of the test. As indicated above, a test with fewer items, given all
Considerations for an AA-MAS
Page 278
other parallel test assumptions are true (except for an additive constant, C12), can be considered
as a tau-equivalent test to the original version. However, in terms of alternate assessments, it is
very difficult to assume that the two tests (the state‘s general assessment and the alternate
assessment) meet any conditions of parallel tests. If the shorter version of the test is different
than the full version of the test in terms of linguistic structure, item difficulty, or the number of
choices (in multiple-choice format), then the shorter version of the test cannot be considered as
a tau-equivalent test. For example, Karvonen and Huynh (2007) indicated that the alternate
assessment items typically require simple cognitive processes such as recall.
Reliability, Validity, and Standard Error of Measurement. Assessments used by states
for accountability purposes are usually developed and field- tested for mainstream students. In
the development process, many of the assessment needs of subgroups (e.g., students with
disabilities) may not be adequately considered. Therefore, there may be many sources of
nuisance variables that can impact the performance of students with disabilities. These sources,
which are also referred to as extraneous variables (Linn & Gronlund, 1995), contaminants, or
construct-irrelevant (Haladyna & Downing, 2004; Messick, 1984), may differentially impact the
reliability and validity of assessments for students with disabilities. Linn and Gronlund (1995)
indicated that ―During the development of an assessment, an attempt is made to rule out
extraneous factors that might distort the meaning of the scores, and follow-up studies are
conducted to verify the success of these attempts‖ (p. 71). Further, Zieky (1988) cautions that a
fairness review to identify construct-irrelevant sources is a major effort when constructing
impartial tests. Welch and Dunbar (Chapter 6, this volume) address some of the issues
concerning the development of AA-MAS by first discussing the best practice in test development
and then highlighting the advantages and disadvantages of various options for modifications.
Reliability. The linguistic complexity of assessment and format and structure of test
booklets (e.g. font size, complex and irrelevant charts and graphs, crowded text on pages) may
cause fatigue and frustration for students with disabilities and may result in a higher level of
Considerations for an AA-MAS
Page 279
measurement error that can substantially reduce the reliability of assessment outcomes for
these students. For example, Abedi, Leon and Mirocha (2003) found a gap of over .32 in the
internal consistency coefficient in scores of state assessments in math between students with
disabilities and students without disabilities. The standard error of measurement was
substantially larger in assessment outcomes for students with disabilities.
More importantly, some of these sources of construct-irrelevant variance may bring
another dimension to the measurement model and make it multi-dimensional. This multidimensionality issue would then introduce more complexity into the comparability concept. For
example, it would be a challenging task, both in terms of content and psychometric properties,
to compare assessment outcomes that are unidimensional in nature (i.e. measuring only the
construct relevant aspects of assessment) with the outcomes that represent several dimensions
or constructs (construct-irrelevant). Multidimensionality of assessment outcomes may directly
impact internal consistency measures (such as alpha coefficient), as these measures are
extremely sensitive to multidimensionality and severely underestimate reliability of
multidimensional assessments when they are supposed to measure a single construct (Cortina,
1993).
Validity. Sources of construct-irrelevant variance discussed above will not only impact
the reliability of assessments, but they also directly affect the construct validity of the
assessments. Content-based state tests are designed to measure constructs that are the target
of the assessments. Therefore, items within a test are often highly correlated when they are
used for students without disabilities for whom the assessments were constructed. For students
with disabilities, however, different sources of construct irrelevant variance may negatively
impact the validity of these assessments. More importantly, it might be difficult to assess the
validity of AA-MAS using external criteria since finding valid external criteria for examining the
validity of AA-MAS can be a major challenge.
Considerations for an AA-MAS
Page 280
As part of a comprehensive set of studies on score comparability, DePascale (2009)
examined the comparability of an AA-MAS (which he called a 2% test). The study addressed
validity questions regarding the modified test by examining the relationship between the states‘
regular test and the modified test. The goals of the study were: ―a) to determine that the 2%
tests were less difficult than the general tests, and b) to determine that the 2% tests provide
more reliable information than the general test in the area of interest for its target population of
the 2% test‖ (p. 11). As one of the major findings of the study, the author indicated that the 2%
test provided reliable information at the extreme low end of the scale.
The findings of this study are very informative in terms of psychometric properties of the
AA-MAS as compared with those for the regular state assessments. While it is true that the
alternate assessments may generally have lower reliability and validity when considering the
entire distribution of content knowledge, these assessments do what they set out to accomplish
for the lower part of the ability distribution (for a comprehensive presentation of validity of AAMAS see Marion, Chapter 9, this volume).
Structural Equation Modeling Approach. Comparing the structural relationship between
test items, item scores and total test score and between different subscales of the tests across
the two assessments (AA-MAS and the general state assessment) using a multiple group
confirmatory factor analytic model can provide useful information (Abedi, 2002). Figure 8-1
presents a multiple group confirmatory model that provides comparability evidence. This model
includes data from states with two groups of students: 1) students with disabilities taking AAMAS and, 2) students without disabilities taking general assessments, which can be used in
content areas such as math, science, or reading/language arts. A set of item parcels can be
constructed based on existing data from each group. Each parcel should include items
representing different subscales but items across parcels should be similar. These item parcels
can then be used for creating a latent variable for the content-based assessment. A set of
performance assessment scores can then be used as external criteria for establishing criterion-
Considerations for an AA-MAS
Page 281
based validity. This set of variables could include student GPA, teacher‘s rating of student
performance, and a score from the portfolio in the content being assessed. Thus, the model
includes two latent variables: one is the test scores, which is computed from the item parcel,
and the other is the external criterion.
Figure 8-1. Multi-group Confirmatory Model
Item Parcel 1
Item Parcel 2
Item Parcel 3
Item Parcel 4
AA-MAS
or
Regular
Assessment
External
Criteria*
GPA
Teacher Rating
Portfolio
Assessment
Others
*External criteria include assessment outcomes other than state test scores.
In this model, it is not necessary to have an equal number of items across the two
assessments; however, the number of parcels across the two groups should be equal. A set of
invariance across the two groups of students taking the two different assessments can be tested
for significance. These include testing invariance of factor loadings of the item parcels with the
overall test latent variables and the overall external criteria and invariance of correlations
between the content-assessment‘s latent variable with the external criteria latent variable. A
Considerations for an AA-MAS
Page 282
significant outcome on the invariance hypotheses would be an indication of a lack of
comparability between the two assessments.
Generalizability approach. A G model can be used to examine comparability between
AA-MAS and the regular state assessments. A multi-facet G design can be used to compare the
two groups in terms of different sources of measurement error such as variation between items
and occasions. The model can be applied separately to each of the two groups. Sources of
variability due to items and occasions (and interaction between items and occasions) can be
compared across the two groups of subjects taking the two different assessments. The overall
G coefficients as well as the percent of variance explained by each of the sources (e.g.
subjects, items, occasions, and interaction between items and occasion) can be compared
across the two groups. Both relative and absolute decisions for computing G coefficient may be
applied and comparisons can be made. For a more detailed discussion of the generalizability
concept and instruction on how to conduct a G study see Brennan (2001) and Shavelson &
Webb (1991).
The structure and size of the variance components, the significance of the main and
interaction effects, and effect sizes across the two assessments can also be compared for any
significant differences. For example, if a linguistic complexity facet accounts for 25% of the
variance in one assessment but explains less than 5% in the other assessment, then such a
difference points to a lack of comparability in terms of the generalizability model.
DIF Approach. Test publishers and states often conduct differential item functioning
(DIF) analyses to identify test items that differentially perform across subgroups of students.
Different student background variables are used for grouping students. DIF analyses are usually
conducted to examine any possible biases due to gender, ethnicity, students‘ socio-economic
status, students‘ disability status, and students‘ language background. DIF analyses by
students‘ disability status using states‘ regular and alternate assessments may shed light on
comparability between the two assessments. For example, it would be informative to compare
Considerations for an AA-MAS
Page 283
the trend of results of DIF on a state general assessment in a specific content (e.g., math or
science) with the results of DIF on a corresponding AA-MAS assessment by some student
background variables such as SES (free/reduced price lunch program) for similarities and
differences. One could identify the number of items labeled as ―A‖, ―B‖, or ―C‖ DIF across AAMAS and the general assessments. Similarly, comparing the pattern of uniform and non-uniform
DIF across AA-MAS and general assessments could provide useful information. However, there
are major limitations in such comparisons when used for comparability purposes.
First, the number of students (sample size) for the AA-MAS is a major challenge since it
would be extremely difficult to find a large enough sample to compare with the focal and
reference groups. Second, the literature clearly suggests that different procedures for computing
DIF may provide quite different outcomes (Abedi, Leon & Kao, 2008). This may be a major
problem in using DIF as a criterion for judging comparability of different assessments since
different approaches may perform differently across the assessments. Third, and most
importantly, test items may perform differentially across students in different subgroups of
disabilities. Results of a study on DIF by different subgroups of disabilities found that a
substantial number of items were identified as DIF for different disability groups but very few or
almost none of the items were identified as DIF across all or several subgroups of disabilities
(Abedi, Leon, & Kao, 2008).
Scale and Score Comparability between AA-MAS and General Assessments
According to the Standards for Educational and Psychological Testing (AERA, APA, &
NCME, 1999), scale and score comparability between two assessments is required in order to
provide similar interpretation of the outcomes measured by those assessments. Winter (2009)
discusses score comparability and indicates that ―In general, test scores can be considered
comparable if they can be used interchangeably‖ (page 6). The author argues, however, that
comparability depends on the level of scores being used. For example, scores reported at the
Considerations for an AA-MAS
Page 284
scale level or achievement score level can be compared only at that level. Additionally these
scores must provide measures of the same set of knowledge and skills, present the same
degree of achievement, and have similar technical properties (Winter, 2009).
AA-MAS tests may have major differences from the general assessments, including the
number and level of difficulty of test items. Even with such differences some evidences of score
comparability can be obtained. ―For example, it may be desirable to interpret scores from a
shortened (and hence less reliable) form of a test by first converting them to corresponding
scores on the full-length version‖ (American Educational Research Association, American
Psychological Association, & National Council on Measurement in Education, 1999, page 52).
Converting scores from the AA-MAS and general assessments on the same scale is extremely
challenging since such conversions require comparability on many different aspects. As Mislevy
(1992) indicated, ―No simple statistical machinery can transform the results of two arbitrarily
selected assessments so that they provide interchangeable information about all questions
about students‘ competencies‖ (p. 91), particularly in the case of AA-MAS and the general
assessments, where there are such major and substantial differences. However, despite the
limitations, such conversions could provide useful information.
Linguistic Comparability between AA-MAS and General State Assessments
Recent literature on the issues concerning assessments consistently demonstrates the
impact of language factors on the assessment outcomes. These factors differentially impact the
performance of subgroups of students such as English language learners and students with
disabilities, particularly those with learning and reading disabilities. Several linguistic features
have been identified in the literature that may have major impacts on the assessment outcomes
for these students (Abedi, 2006; Abedi, 2007 [LEP Partnership] Abedi & Lord, 2001; Sato, 2007
[LEP Partnership]). Research literature also suggests that reducing the level of unnecessary
linguistic complexity of assessments helps to close the gap between subgroups, such as
Considerations for an AA-MAS
Page 285
students with disabilities and ELL students, and the main group (Abedi, 2006; Abedi, Leon &
Mirocha, 2003). The process of reducing the level of unnecessary linguistic complexity of
assessments is often referred to as the linguistic modification of assessments. While the
linguistic modification process does not affect the performance of native speakers of English at
the higher performance level (thus, not affecting the validity of assessments), it helps reduce the
performance gap between students with disabilities and students without disabilities.
One approach in examining comparability between AA-MAS and general state
assessments is to compare their linguistic structures to see if the level of linguistic complexity is
similar across the two assessments. In this comparison, a distinction must be made between
linguistic features that are related and those that are not related to the content being measured.
To promote comparability with the states‘ general assessments, the linguistic structure related
to the content may not be changed since changing linguistic structures that are content related
may alter the construct being measured. Therefore, linguistic modification should only be
applied to language that is not related to the construct being measured. The distinction between
related and unrelated linguistic features to the content can be made by a team of experts that
includes content and linguistic experts.
Literature provides clear guidelines and instructions on how to conduct linguistic
modification of assessments and how to distinguish between necessary and unnecessary
linguistic complexity in the assessments (Abedi, 2006, 2007; Sato, 2007). These guidelines
would help in two important contexts: 1) to identify which linguistic features should be
considered in making judgments on comparability between state general and alternate
assessments, and 2) to inform the development of alternate assessments when linguistic
modification is considered as a factor in the alternate assessment process.
Once again, it is extremely important to distinguish between linguistic structure that is
related to the content being measured and unnecessary linguistic complexity that is unrelated to
the content. As indicated earlier, some states may choose to remove or reduce unnecessary
Considerations for an AA-MAS
Page 286
linguistic complexity of the AA-MAS to make them more accessible for students with disabilities,
which should be considered a reasonable practice. Reducing unnecessary linguistic complexity
of assessments makes them also more accessible for students with no disability who are at the
lower level of achievement performance distribution.
Assessing the Level of Linguistic Complexity of the AA-MAS and General Assessment Test
Items
Outcomes of the studies on the impact of linguistic factors on the assessment of English
language learners and students with disabilities have led to identification of 48 linguistic features
that make assessment more complex for these students (see, for example, Abedi, 2006; Abedi
& Lord, 2001). The first step in examining linguistic comparability is to identify which of the
linguistic features are present in the item and the seriousness of their effects. A rating system
for evaluating the level of linguistic complexity of test items was developed. The rating system
consists of two different rating scales: (1) an analytical scale, and (2) a holistic scale. Test items
in the AA-MAS and general assessments may be rated on both scales, and then the ratings can
be compared across the AA-MAS and the general assessment. We will elaborate on each of
these rating approaches below:
Analytical Rating. Figure 8-2 presents a rubric for rating the level of complexity on each
of the 14 features for each test item. The ratings are based on a 5-point Likert scale, with ―1‖
indicating no complexity present with respect for that particular feature and ―5‖ suggesting a
high level of linguistic complexity with that feature. Abedi and colleagues combined the 48
linguistic features mentioned above into 14 general categories for ease of rating linguistic
complexities (Abedi & Lord, 2001). Ratings were performed on the overall 14 categories. Each
test item receives 14 ratings, one for each linguistic feature. For example, with respect to
linguistic feature number 1 ―Word frequency/familiarity‖, if the words used in the item are ―very
familiar‘ and ―frequently‖ being used, then the item receives a rating of ―1‖, ―no complexity‖.
However, if the word is unfamiliar, or being used less frequently, then depending on the level of
Considerations for an AA-MAS
Page 287
unfamiliarity and low frequency, it receives ratings between 2 to 5. Judgments on the
familiarity/frequency of the word can be made based on sources such as The American
Heritage Word Frequency Book (Carroll et al., 1971) and the Frequency Analysis of English
Usage: Lexicon and Grammar (Francis & Kucera, 1982). The highest rating of 5 in this example
would refer to a word that is extremely unfamiliar and rarely occurring.
Figure 8-2. Rubric for Rating Level of Linguistic Complexity
Degree of Complexity
Not
Complex
Linguistic Feature
1
2
3
4
1. Word frequency/ familiarity
2. Word length
3. Sentence length
4. Passive voice constructs
5. Long noun phrases
6. Long question phrases
7. Comparative structures
8. Prepositional phrases
9. Sentence and discourse
structure
10.Subordinate clauses
11. Conditional clauses
12. Relative clauses
13. Concrete vs. abstract or
impersonal presentations
14. Negation
Most
Complex
5
Holistic Rating. Similar to the ratings that are assigned based on the analytical
procedure, this rating is on a 5-point Likert scale, ―1‖ representing items with no or minimal level
of linguistic complexity and ―5‖ showing an item with an extremely complex linguistic structure.
Figure 8-3 shows the Holistic Rating Rubric. As Figure 8-3 shows, a test item free of linguistic
complexity (with a rating of ―1‖) does not suffer from any of the 14 linguistic complexity threats.
For example, the item uses familiar or frequently used words, the words as well as sentences in
these items are generally shorter, there are no complex conditional and/or adverbial clauses,
Considerations for an AA-MAS
Page 288
and there are no passive voices or abstract presentations. On the contrary, an item with a
severe level of linguistic complexity contains all or many sources of threats.
Figure 8-3. Holistic Item Rating Rubric
QUALITY
1
EXEMPLARY ITEM
Sample Features:
Familiar or frequently used words; word length generally shorter
Short sentences and limited prepositional phrases
Concrete item and a narrative structure
No complex conditional or adverbial clauses
No passive voice or abstract or impersonal presentations
2
ADEQUATE ITEM
Sample Features:
Familiar or frequently used words; short to moderate word length
Moderate sentence length with a few prepositional phrases
Concrete item
No subordinate, conditional, or adverbial clauses
No passive voice or abstract or impersonal presentations
3
WEAK ITEM
Sample Features:
Relatively unfamiliar or seldom used words
Long sentence(s)
Abstract concept(s)
Complex sentence/conditional tense/adverbial clause
A few passive voice or abstract or impersonal presentations
4
ATTENTION ITEM
Sample Features:
Unfamiliar or seldom used words
Long or complex sentence
Abstract item
Difficult subordinate, conditional, or adverbial clause
Passive voice/ abstract or impersonal presentations
5
PROBLEMATIC ITEM
Sample Features:
Highly unfamiliar or seldom used words
Very Long or complex sentence
Abstract item
Very difficult subordinate, conditional, or adverbial clause
Many passive voice and abstract or impersonal presentations
Ratings on the linguistic modification (both analytical and holistic) provide diagnostic
information on the linguistic barriers present in test items. This information may help item writers
or test developers to identify problem items. These items can then be corrected for such
problems. Since linguistic modification ratings are on a Likert-scale, median ratings can be
Considerations for an AA-MAS
Page 289
computed and can be used for decisions on how the items should be modified. Different
patterns of linguistic complexity across the two assessments may lead to the conclusion that the
two assessments are not linguistically comparable (for a detailed description of linguistic
complexity assessment, see Abedi, 2006).
Basic Text Features
Text Format and Text Features- This feature includes typeface and point size, passage
and item placement on page(s), and the relevance and clarity of all visuals within a passage. It
is important to consider typeface and point size when determining if a test passage and its items
are accessible to students with low visibility. Similarly, pages with excessive blank space, or
conversely, with small margins, may unfairly affect students with low visibility. Additionally, it is
essential to determine if the visuals (graphs, tables, charts, and pictures) within a passage are
relevant, meaning that they are needed to answer the item, and if they are clearly labeled.
Visuals that are not relevant or clearly labeled may increase the cognitive load of perceiving
information for students with disabilities.
Type of Passage/Item. This feature identifies the genre of the passage (descriptive,
narrative, expository, poetry, or persuasive). This feature also determines whether a test item is
informational or inferential. Items that are informational can be answered using only slightly
paraphrased or verbatim information that is found in the passage, whereas items that are
inferential require the student to combine information from the text together with their own
background knowledge in order to recognize implicit relationships and outcomes. Therefore,
items that are inferential may significantly increase the cognitive load for students with
disabilities and, thus, hinder students‘ ability to accurately display their understanding of the
passage.
Considerations for an AA-MAS
Page 290
Comparability in terms of Depth of Knowledge (DOK)
Depth-of-knowledge (DOK) comparability is confirmed if what is elicited from students on
the assessments is as cognitively demanding as what students are expected to know and do as
stated in the state and national standards. DOK consistency is defined as the level of
consistency between the cognitive demands of standards and the cognitive demands of the
assessment items. If between 40% and 50% of the assessment items are at or above the DOK
levels of the objectives, then the DOK consistency criterion is ―weakly‖ met. Webb (1999)
defines four levels of cognitive complexity when comparing the cognitive demands of the
standards and assessment items. They are: Level 1 (Recall), Level 2 (Skills and Concepts),
Level 3 (Strategic Thinking), and Level 4 (Extended Thinking).
Level 1 (Recall): Level 1 items require students to use simple skills or abilities. Examples
include recall of information. Key words that signify Level 1 include identify, recall, measure, and
recognize.
Level 2 (Skill/Concept): Level 2 items demand a higher level of cognitive complexity
compared to Level 1 items. Assessment items at Level 2 require some decision making on how
to approach problems or activities. For example, Level 2 keywords for math include terms such
as classify, estimate, compare, and organize. These actions imply more than one step.
Level 3 (Strategic Thinking): Assessment items at this level require reasoning, planning,
and using evidence, which are at a higher level of thinking than the previous two levels. In most
instances, at this level students are required to explain their thinking. The cognitive demands at
Level 3 are complex and abstract. An activity, however, that has more than one possible answer
and requires students to justify the response they give would most likely be at Level 3.
Level 4 (Extended Thinking): Level 4 items require the highest level of cognitive demand
in Webb‘s model of depth of knowledge. This level demands complex reasoning, planning,
developing, and thinking. Assessment items at Level 4 may include activities such as designing
Considerations for an AA-MAS
Page 291
and conducting experiments, analyzing and interpreting results, combining and synthesizing
ideas into new concepts, and critiquing experimental designs (Webb, et al., 2006).
Comparability Issues in the Accommodated Assessments for Students with Disabilities
Many different forms of accommodations are being used for students with different types
of disabilities. However, some of these accommodations may alter the construct being
measured; therefore, the issues concerning comparability of accommodated and nonaccommodated assessments are of paramount importance to this chapter since many of the
features that are incorporated in AA-MAS are being used as a form of accommodation for
students with disabilities.
The most commonly used accommodations for students with disabilities are: using
Braille, using computerized assessments, dictation of response to a scribe, extended time,
interpreter for instructions, marking answers in test booklets, reading aloud test items, reading
or simplifying test directions, and providing test breaks (Thurlow, et al., 2000). We present a
summary of studies that have examined the validity of assessments under these
accommodations. As can be seen from these summaries, research evidence suggests that
some of these accommodations alter the construct being measured. For others, however, there
is not much evidence to judge the validity of assessments using those accommodations. Issues
concerning validity of accommodations are directly related to comparability of accommodated
and non-accommodated assessments. When an accommodation is not valid, i.e., when it alters
the construct being measured, then the outcomes of assessments under this accommodation
are not comparable with assessments conducted under standard conditions with no
accommodations provided.
Braille is used for students with blindness or significant visual impairments. Developing a
Braille version of the test may be more difficult for some items than others. It would be
challenging to use Braille items with diagrams and special symbols (Bennett, Rock, & Kaplan,
Considerations for an AA-MAS
Page 292
1987; Bennett, Rock, & Novatkoski, 1989; Coleman, 1990). Thurlow & Bolt (2001) recommend
using Braille for students with severe visual impairments. Also, Braille is recommended to be
paired with extended time (Thurlow & Bolt, 2001).
Computerized assessment can be used for students with physical impairments who
have difficulty in responding to items in paper-and-pencil format. Some studies suggest that this
accommodation is effective in increasing the performance of students with disabilities (see for
example, Russell & Haney, 1997; Russell, 1999; Russell & Plati, 2001). Other studies did not
find computerized assessment to be effective (MacArthur & Graham, 1987) or even as effective
as traditional assessments (Hollenbeck, Thurlow & Bolt, 2001; Tindal, Stieber & Harniss, 1999;
Watkins & Kush, 1988; Varnhagan & Gerber, 1984).
Extended time is one of the most commonly used and most controversial forms of
accommodation for students with different types of disabilities (SWD). Some studies found that
extended time affects the performance of both SWD and non-SWD students and, therefore,
makes the validity of this accommodation suspect. Similarly, Thurlow et al., (2000) expressed
concern on the validity of this accommodation. Chiu and Pearson (1999) found extended time to
be an effective accommodation for students with disabilities, particularly for those with learning
disabilities. Some studies found extended time to help students with disabilities in math (Chiu &
Pearson, 1999; Gallina, 1989). However, other studies did not show an effect of extended time
on students with disabilities (Fuchs et al., 2000; Marquart, 2000; Munger & Loyd, 1991). Studies
on the effect of extended time on language arts did not find this accommodation to be effective
(Fuchs et al., 2000; Munger & Loyd, 1991). Thus, research on this particular accommodation
produced inconsistent results. More studies are needed to make a firm recommendation
regarding the use of this accommodation.
The interpreter for instructions accommodation is recommended for students with
hearing impairments. Ray (1982) found that adaptations in the directions help deaf children
score the same as other students (see also Sullivan, 1982). Thurlow & Bolt (2001) recommend
Considerations for an AA-MAS
Page 293
that using an interpreter for instructions may be beneficial to students with hearing impairments.
However, not much information exists on the validity of this accommodation.
The marking answers in test booklet accommodation can be used for students with
difficulties in mobility coordination. Some studies on the effectiveness of this accommodation did
not find a significant difference between those tested under this accommodation and those
using separate answer sheets (Rogers, 1983; Tindal, Heath, Hollenbeck, Almond, & Harniss,
1998). However, other studies found lower performance for students using this accommodation
(Mick, 1989). Since a majority of studies did not show a performance increase as a result of this
accommodation, it would be safe to say that this may not have much of an impact on the
construct being measured.
A read aloud test is used for students with learning disabilities and students with physical
or visual impairments. While some studies found this accommodation to be valid in math
assessments (Tindal et al., 1998), there have been concerns over the use of this
accommodation in reading and listening comprehension tests (see for example, Burns, 1998;
Phillips, 1994), because the construct being measured may be changed and, thus, the validity of
the assessment is affected (see also, Bielinski, Thurlow, Ysseldyke, Freidebach, & Freidebach,
2001; Meloy, Deville, & Frisbie, 2000).
The reading or simplifying test directions accommodation is appropriate for students with
reading/ learning disabilities. A study by Elliot, Kratochwill & McKevitt (2001) suggests that this
accommodation affects the performance of both students with disabilities (63.4%) and students
without disabilities (42.9%), thus expressing concerns over the validity of this accommodation.
Test breaks can help students with different forms of disabilities. In a study, DiCerbo,
Stanley, Roberts, & Blanchard (2001) found that students receiving test breaks obtained scores
significantly higher than those under standard testing conditions, and that middle and low ability
readers benefited more from this accommodation than high ability readers. However, another
study (Walz, Albus, Thompson, & Thurlow, 2000) found that students with disabilities did not
Considerations for an AA-MAS
Page 294
benefit from a multiple-day test administration, while students without disabilities did benefit.
These results show quite the opposite of what is expected of valid accommodations.
The summary of research presented above on some of the commonly used
accommodations shows a lack of consensus regarding validity and comparability of
accommodated assessments as compared with the general assessments with no
accommodations provided. As indicated earlier in this chapter, when accommodations alter the
construct being measured, the accommodated assessment outcomes are not comparable with
the non-accommodated assessments.
The issues of comparability of accommodated and non-accommodated assessments are
relevant to our discussion of comparability between AA-MAS and regular state assessments for
two main reasons. First, with AA-MAS some students with disabilities still need
accommodations that are recommended by their IEP team. Therefore, knowledge of
comparability of accommodated and non-accommodated AA-MAS would help states to provide
comparability evidence to peer reviewers and other authorities. Second, the concept of
comparability is well defined on accommodation studies. While there may not be sufficient
literature on the comparability of all accommodations used by states, we have enough research
to help us design comprehensive studies for examining comparability between various
assessments.
Recommendations to NYSED for Establishing Comparability between AA-MAS and
General Assessments
Comparability of the outcome of AA-MAS with the general state assessments is one the
most fundamental aspects of the assessment and accountability system for students with
disabilities who are eligible for taking AA-MAS . The comparability issues may seriously impact
many aspects of the academic careers of these students (often referred as the 2% group),
including their instruction, promotion, and graduation. Literature presents the comparability
argument mostly in terms of psychometric and content comparability (see for example, AERA,
Considerations for an AA-MAS
Page 295
APA, NCME Standards, 1999; DePascale, 2009). While content and psychometric comparability
provides convincing evidence on the comparability of AA-MAS with the general state
assessments, looking at a more comprehensive picture on comparability will provide states with
the information they need to present a strong justification for development and use of AA-MAS.
In this chapter we presented different criteria for judging and examining the
comparability between AA-MAS and state general assessments. These criteria included
comparability with respect to content and construct, alignment with the state content standards,
classical measurement concept of comparability, psychometrics, linguistics, depth of knowledge
and accommodations. Such discussions and guidelines could help the State of New York in
developing and validating AA-MAS assessments in different content areas. Useful information
and guidelines are also provided by researchers and practitioners for developing AA-MAS tests.
For example, in a report by the Council of Chief State School Officers (CCSSO, 2007),
guidelines are provided on strategies for states to prepare for and respond to peer reviewers.
Similarly, researchers provided recommendations on the use of cognitive interviews in the
design and development of AA-MAS (Almond, et al., in press). Literature also provides a
summary of research on item and test alteration focusing on AA-MAS along with guidelines on
the nature and implementation of these alterations (Kettler & Almond, 2009). While many test
publishers and states provided comparability evidence on a few of these aspects, this chapter
may help New York test developers to provide a comprehensive plan for comparability of any
future AA-MAS development.
In general, to respond to the mandate of inclusion of students with disabilities, states
must be able to present evidence that alternate assessment outcomes are comparable with the
outcomes of general assessments. Lack of comparability between the alternate and general
assessments jeopardizes the academic career of students with disabilities in many different
ways, including the promotion and graduation of these students.
Considerations for an AA-MAS
Page 296
While the proposed criteria for examining comparability in this chapter apply to different
content areas, the application of some of these criteria may be slightly different across different
content areas. For example, the linguistic comparability concept may apply differently to content
areas in which language is the target of measurement (e.g., reading/language arts) in areas
where unnecessary linguistic complexity may be considered as construct irrelevant sources
(e.g., math and science).
A major application of the comparability discussion may be on the type of credential that
is most appropriate for students in the State of New York (Regents, Local diploma, IEP
certificate). Some options are available for students with disabilities taking AA-MAS. If the
comparability between AA-MAS and the regular state assessments can be established, then it
would be reasonable to recommend credentials for students with disabilities (particularly for
those eligible for the AA-MAS) that are similar to those recommended for non-disabled students.
For example, students can receive a Local diploma if they follow the same academic program
as the Regents diploma but at a lower cut score on the exam. Would a lower cut score make the
assessment outcome less aligned with the state content standards for a passing grade or
graduation? Currently, the IEP certificate is primarily for those students who have significant
cognitive disabilities and are taking the AA-AAS. For students with disabilities who are taking
AA-MAS, the more reasonable option is the local diploma. A report by NCEO (Wiener, 2006)
presents one alternative way to meet diploma requirements using an AA-GLAS rather than an
AA-MAS.
Guidelines for Examining Comparability of AA-MAS: How Much Comparability is
Necessary?
In this chapter many different approaches to comparability of AA-MAS with the general
state assessments were discussed. It would be extremely challenging, if not impossible, to
establish comparability between the two assessments in all areas discussed in this chapter.
Therefore, the main question is in which areas and to what extent evidence is needed to
Considerations for an AA-MAS
Page 297
suggest the AA-MAS outcome is comparable with the general assessment outcome. To answer
this question, we define comparability in two levels. Level 1 includes comparability features that
are necessary and are required in order to assume comparability across the two measures, and
level 2 includes features that are desired but not absolutely necessary in establishing
comparability between the two assessments.
Necessary Features for Establishing Comparability between AA-MAS and General
Assessments
The decision on which comparability features are absolutely necessary and which are
desired may be more speculative as there is not enough research evidence on which to base a
decision. Therefore, based on existing literature and based on the author‘s own professional
judgment, the following features are deemed to be necessary as the minimum requirement to
establish comparability between AA-MAS and regular state assessments:
1. Content and Construct Comparability. This feature is one of the most important aspects
of comparability. This level of comparability can be established by applying a
combination of different approaches such as experts‘ review and alignment to the state
content standards. In conducting expert reviews, New York State Education Department
(NYSED) may form a team of experts in the targeted content area to judge the level of
comparability of AA-MAS with the regular state assessment. The team should include
experts in the area of assessment and accommodation of students with disabilities
focusing on the 2% population, content area experts, and test item writers. NYSED could
develop and validate a rubric for assessing comparability. The rubric validation process
should include focus groups and cognitive labs to assure clarity of instruction for rating
comparability. The rubric may use a 5-point Likert-Scale for rating comparability. NYSED
may then decide on the level of exact or within point agreement between AA-MAS and
general assessment ratings.
Considerations for an AA-MAS
Page 298
2. Scale and Score Comparability. As recommended by the Joint Standards (AERA, APA,
& NCME, 1999, page 52) score comparability can be roughly achieved by converting
scores from the AA-MAS and general assessments on a same scale. While such
conversion is extremely challenging it could provide useful information.
3. Depth of Knowledge Comparability. One could expect a fair level of comparability
between two assessments measuring the same content if they measure the same level
of depth of knowledge. As in the case of content and construct comparability discussed
above, NYSED can form a team of experts to judge the level of depth of knowledge
across the two assessments. Following Webb‘s methodology (Webb, 1999), ratings of
the depth of knowledge can be provided and compared. A cut point on the level of
consistency between ratings of the depth of knowledge of the two assessments can then
be used to judge the comparability.
4. Accommodation Comparability. Accommodated assessment outcomes could be invalid if
the accommodations alter the construct. Many students with disabilities require
accommodations based on their IEPs. However, research on the validity of
accommodations for students with disabilities is very limited. NYSED may compare
accommodations used under the two assessment conditions and provide evidence
based on the literature that there are no validity concerns that could differentially affect
validity of accommodated assessments under the two testing conditions.
Desired Features for Establishing Comparability between AA-MAS and General Assessments
1. Psychometric Comparability. A comparison between the overall psychometric properties
of the two assessments may shed light on the comparability issues. It would be
informative to compare the reliability and validity coefficients of the two assessments.
For example, a comparison between the internal consistency coefficients (Cronbach
alpha) between the two assessments can be done. Similarly, criterion-related validity
Considerations for an AA-MAS
Page 299
coefficients of the two assessments can be compared. For example, the structural
relationships between parcels of items and the total test as well as the relationships
between test scores and external criteria can be compared using a multiple group
confirmatory factor analyses as elaborated in Figure 8-1. Examining a set of invariance
between the structural relationships of the two assessments may shed light on the
comparability of the two assessments.
2. Comparability between Linguistic Structure of the Two Assessments. Different features
of linguistic complexities that may impact the validity of assessments were introduced
earlier in this chapter. A comparison between analytical ratings (Figure 8-2) and holistic
ratings (Figure 8-3) would provide supporting evidence on the comparability of the
assessments.
3. Comparability between the Two Assessments on Basic Text Features. Comparability
between the basic features of the two assessments may provide additional evidence of
comparability. It would be helpful if the text features such as the presentation of the
assessments (e.g., computer versus paper-and-pencil), formatting, fonts, tables and
charts, and pagination of the two assessments are similar. For example, two
assessments may not be highly comparable if one uses complex tables and charts or
crowded pages and the other uses simple tables and charts with a large point size and
less crowded pages.
Considerations for an AA-MAS
Page 300
References
Abedi, J. (2002). Standardized achievement tests and English language learners:
Psychometrics issues. Educational Assessment, 8(3), 231-257.
Abedi, J., Leon, S. and Kao, J. (2008). Examining Differential Item Functioning in Reading
Assessments for Students with Disabilities. Los Angeles: University of California, Center
for the Study of Evaluation/National Center for Research on Evaluation, Standards, and
Student Testing.
Abedi, J., Leon, S., & Mirocha, J. (2003). Impact of student language background on contentbased performance: Analyses of extant data (CSE Tech. Rep. No. 603). Los Angeles:
University of California, National Center for Research on Evaluation, Standards, and
Student Testing.
Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in
Education, 14(3), 219-234.
Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Monterey, CA:
Brooks/Cole.
Almond, P. J., Cameto, R., Johnstone, C. J., Laitusis, C., Lazarus, S., Nagle, K., Parker, C. E.,
Roach, A. T., & Sato, E. (in press). White paper: Cognitive interview methods in reading
test and item design and development for alternate assessments based on modified
academic achievement standards (AA-MAS). Dover, NH: Measured Progress and Menlo
Park, CA: SRI International.
Bennett, R.E., Rock, D.A., & Jirele, T. (1987). GRE score level, test completion, and reliability
for visually impaired, physically handicapped, and nonhandicapped groups. The Journal
of Special Education, 21 (3), 9-21.
Bennett, R.E., Rock, D.A., & Kaplan, B.A. (1987). SAT differential item performance for nine
handicapped groups. Journal of Educational Measurement, 24 (1), 44-55.
Bennett, R.E., Rock, D.A., & Novatkoski, I. (1989). Differential item functioning on the SAT-M
Braille Edition. Journal of Educational Measurement, 26 (1), 67-79.
Bielinski, J., Thurlow, M., Ysseldyke, J., Freidebach, J., & Freidebach, M. (2001). Read-aloud
accommodation: Effects on multiple-choice reading & math items (Technical Report 31).
Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Brannon Brennan, R. L. (2001). Generalizability theory.New York: Springer-Verlag.
Burns, E. (1998) Test accommodations for students with disabilities. Springfield: Charles C.
Thomas, Publisher, LTD.
Burton and Linn, 1994. E. Burton and R.L. Linn , Comparability across assessments: Lessons
from the use of moderation procedures in England. In: CSE Technical Report 369,
National Center for Research on Evaluation, Standards, and Student Testing (1994).
Considerations for an AA-MAS
Page 301
Chiu, C. W. T., & Pearson, P. D. (1999). Synthesizing the effects of test accommodations for
special education and limited English proficiency students. Paper presented at the
National Conference on Large Scale Assessment.
Coleman, P.J. (1990). Exploring visually handicapped children‘s understanding of length (math
concepts). (Doctoral dissertation, The Florida State University, 1990). Dissertation
Abstracts International, 51, 0071.
Cortiella, C. (2007). Learning opportunities for your child through alternate assessments:
Alternate assessments based on modified academic achievement standards.
Minneapolis, MN: University of Minnesota, National center on Educational Outcomes.
Cortina, J. M (1993). What is coefficient alpha? An examination of theory and application.
Journal of Applied Psychology, 78, 98-104.
DePascale, C. (2009). Modified tests for modified achievement standards: Examining the
comparability of a 2% test. Dover, NH: National Center for the Improvement of
Educational Assessment.
DiCerbo, K., Stanley, E., Roberts, M., & Blanchard, J. (April, 2001). Attention and standardized
reading test performance: Implications for accommodation. Paper presented at the
annual meeting of the National Association of School Psychologists, Washington, DC,
2001.
Eckhout, T., Larsen, A., Plake, B., & Smith, D. (2007). Aligning a state‘s alternative standards to
regular core content standards in reading and mathematics: A case study. Applied
Measurement in Education 20(1), 79-100.
Elliott, S., Kratochwill, T., & McKevitt, B. (2001). Experimental analysis of the effects of testing
accommodations on the scores of students with and without disabilities. Journal of
School Psychology, 31(1), 3-24.
Elliott, S. N., & Roach, A. T. (2007). Alternate assessments of students with significant
disabilities: Alternative approaches, common technical challenges. Applied
Measurement in Education, 20, 301-333.
Elosua, P., & Lopez-Jauregui, A. (2008). Equating between Linguistically Different Tests:
Consequences for Assessment. Journal of Experimental Education, 76(4), 387-402.
Filbin, J. (2008). Lessons from the initial peer review of alternate assessments based on
modified achievement standards. Paper developed for the U.S. Department of
Education, Office of Elementary and Secondary Education.
Francis, W. N., & Kucera, H. (1982). Frequency analysis of English usage: Lexicon and
grammar. Boston: Houghton Mifflin.
Fuchs, L.S., Fuchs, D., Eaton, S.B., Hamlett, C., & Karns, K. (2000). Supplementing teacher
judgments about test accommodations with objective data sources. School Psychology
Review, 29 (1), 65-85.
Considerations for an AA-MAS
Page 302
Fuchs, L.S., Fuchs, D., Eaton, S.B., Hamlett, C., Binkley, E., & Crouch, R. (2000). Using
objective data source to enhance teacher judgments about test accommodations.
Exceptional Children, 67 (1), 67-81.
Gallina, N.B. (1989). Tourette‘s syndrome children: Significant achievement and social behavior
variables (Tourette‘s syndrome, attention deficit hyperactivity disorder) (Doctoral
dissertation, City University of New York, 1989). Dissertation Abstracts International, 50,
0046.
Gong, B. (1999). Relationship between student performance on the MCAS (Massachusetts
Comprehensive Assessment System) and Other Tests. National Center for the
Improvement of Educational Assessment, Inc.
Gong, B. (2007). Considerations in designing a “2% Assessment” (AA-MAS): A beginning
framework and examples of conceptual possibilities. Paper presented at the Special
Education Partnership Conference on Alternate Assessments Based on Modified
Academic Achievement Standards. Washington, DC July 26, 2007.
Gong, R., & Blank, R. (2002). Designing school accountability systems: Towards a framework
and process. Washington, DC: The Council of Chief State School Officers.
Gong, B. and Marion, S. (2006). Dealing with flexibility in assessment for students with
significant cognitive disabilities. National Center for the Improvement of Educational
Assessment, Inc.
Haladyna, T. M., & Downing, S. M. (2004). Construct-irrelevant variance in high-stakes testing.
Educational Measurement: Issues and Practice, 23(1), 17-27.
Hollenbeck, K., Tindal, G., Stieber, S., & Harniss, M. (1999). Handwritten vs. word processed
statewide compositions: Do judges rate them differently? Eugene, OR: University of
Oregon, BRT.
Karvonen, M., & Huynh, H. (2007). Relationship between IEP characteristics and test scores on
alternate assessment for students with significant cognitive disabilities. Applied
Measurement in Education, 20(3), 273-300.
Kettler, R., & Almond, P. (2009). Improving reading measurement for alternate assessment:
Suggestions for designing research on item and test alterations.
Lazarus, S. S., Rogers, C., Cormier, D., & Thurlow, M. L. (2008). States’ participation guidelines
for alternate assessments based on modified academic achievement standards (AAMAS) in 2008 (Synthesis Report 71). Minneapolis, MN: University of Minnesota,
National Center on Educational Outcomes.
Lazarus, S.S., Thurlow, M. L., Christensen, L.L., & Cormier, D. (2007). States‘ alternate
assessments based on modified achievement standards (AA-MAS) in 2007 (Synthesis
Report 67). Minneapolis, MN: University of Minnesota, National Center on Educational
Outcomes.
Linn, R. (1993). Linking results of distinct assessments. Applied Measurement in Education,
6(1), 83-102.
Considerations for an AA-MAS
Page 303
Linn, R. L., & Gronlund, N. E. (1995). Measurement and Assessment in Teaching. (7th ed.). NJ:
Merril.
Lowrey, K.A., Drasgow, E., Renzaglia, A., & Chezan, L. (2009). Impact of alternate assessment
on curricula for students with severe disabilities. Assessment for Effective Intervention,
32(4), 244-253.
MacArthur, C.A., & Graham, S. (1987). Learning disabled students‘ composing under three
methods of text production: Handwriting, word processing, and dictation. The Journal of
Special Education, 21 (3), 22-42.
Marion, S. (2006, October 10). Introduction to Comparability. Presented at the Seminar on
Inclusive Assessment in Denver, CO.
Marquart, A. (2000). The use of extended time as an accommodation on a standardized
mathematics test: An investigation of effects on scores and perceived consequences for
students of various skill levels. Paper presented at the annual meeting of the Council of
Chief State School Officers, Snowbird, UT.
Meloy, L.L., Deville, C., & Frisbie, C. (2000). The Effect of a Reading Accommodation on
Standardized Test Scores of Learning Disabled and Non Learning Disabled Students.
Paper presented at the annual meeting of the National Council on Measurement in
Education (New Orleans, LA).
Messick, S. (1984). The psychology of educational measurement. Journal of Educational
Measurement, 21, 215-237.
Mick, L.B. (1989). Measurement effects of modifications in minimum competency test formats
for exceptional students. Measurement and Evaluation in Counseling and Development,
22, 31-36.
Mislevy, R. (1992). Linking educational assessments: Concepts, issues, methods, and
prospects. Princeton, NJ: ETS Policy Information Center.
Moore, A.D. & O‘Neal, S. (2004). A study of the alignment between the New Mexico K-12
Content Standards, Benchmarks, and Performance Standards and the draft state
assessment. (Unpublished research study.) Santa Fe, NM: New Mexico State
Department of Education.
Munger, G.F., & Loyd, B.H. (1991). Effect of speededness on test performance of handicapped
and nonhandicapped examinees. Journal of Educational Research, 85 (1), 53-57.
Olson, B., Mead, R., & Payne, D. (2002). A report of the standard setting method for alternate
assessments for students with significant disabilities (Synthesis Report 47). Minneapolis,
MN: University of Minnesota, National Center on Educational Outcomes.
Perez, J.V. (1980). Procedural adaptations and format modifications in minimum competency
testing of learning disabled students: A clinical investigation (Doctoral dissertation,
University of South Florida, 1980). Dissertation Abstracts International, 41, 0206.
Considerations for an AA-MAS
Page 304
Phillips, S.E. (1994). High stakes testing accommodations: Validity vs. disabled rights. Applied
Measurement in Education, 7 (2), 93-120.
Rabinowitz, S., & Schroeder, C. (2006). Creating aligned standards and assessment systems.
Washington, DC: The Council of Chief State School Officers.
Ray, S.R. (1982). Adapting the WISC-R for deaf children. Diagnostique, 7, 147-157.
Roach, A. T. (2005). Alternate Assessment as the ―Ultimate Accommodation‖: Four Challenges
for Policy and Practice. Assessment for Effective Intervention, 31, (1), 73-78.
Roach, A. T., & Elliott, S.N. (2004). Alignment analysis and standard setting procedures for
alternate assessments. WCER Working Papers, No. 2004-1. Available at:
http://www.wcer.wisc.edu/publications/workingpaper/abstract/Working_Paper_No_2004
_1.asp
Rogers, W.T. (1983). Use of separate answer sheets with hearing impaired and deaf school
age students. B.C. Journal of Special Education, 7 (1), 63-72.
Russell, M. (1999). Testing writing on computers: A follow-up study comparing performance on
computer and on paper. Educational Policy Analysis Archives, 7.
Russell, M., & Haney, W. (1997). Testing writing on computers: An experiment comparing
student performance on tests conducted via computer and via paper-and-pencil.
Educational Policy Analysis Archives, 5(3).
Russell, M., & Plati, T. (2001). Effects of computer versus paper administration of a statemandated writing assessment. TCRecord.org. Retrieved January 23, 2001, from the
World Wide Web: http://www.tcrecord.org/PrintContent.asp?ContentID=10709.
Sato, E. (2007). A Guide to Linguistic Modification: Increasing English Language Learner
Access to Academic Content: Washington, DC: The U.S. Department of Education—
LEP Partnership.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability, theory:Aprimer.Newbury Park, CA:
Sage Publication.
Subkoviak, J.J. (1988). A practitioner‘s guide to computation and interpretation of reliability
indices for mastery test. Journal of Educational Measurement, 25(1), 47-55.
Sullivan, P.M. (1982). Administration modifications on the WISC-R Performance Scale with
different categories of deaf children. American Annals of the Deaf, 127 (6), 780-788.
Thompson, S., Lazarus, S., Clapper, A., & Thurlow, M. (2006). Adequate yearly progress of
students with disabilities: Competencies for teachers. Teacher Education and Special
Education, 29 (2), 137-147.
Thorndike, R. M. (2005). Measurement and Evaluation in Psychology and Education. New
Jersey, Pearson, Merrill.
Considerations for an AA-MAS
Page 305
Thurlow, M. & Bolt, S. (2001). Empirical support for accommodations most often allowed in
state policy. Minnesota: National Center for Educational Outcome. NCEO Synthesis
Report #41.
Thurlow, M., House, A., Boys, C., Scott, D., & Ysseldyke, J. (2000). State participation and
accommodation policies for students with disabilities: 1999 Update (Synthesis Report
33). Minneapolis, MN: University of Minnesota, National Center on Educational
Outcomes.
Tindal, G. (2005). Alignment of alternate assessments using the Webb system. Washington,
DC. Council of Chief State School Officers.
Tindal, G. Heath, B., Hollenbeck, K., Almond, P., & Harniss, M. (1998). Accommodating
students with disabilities on large-scale tests: An empirical study of student response
and test administration demands. Exceptional Children, 64 (4), 439-450.
U.S. Department of Education (2007). Modified Academic Achievement Standards, NonRegulatory Guidance. Washington, DC.
Varnhagen, S., & Gerber, M.M. (1984). Use of microcomputers for spelling assessment:
Reasons to be cautious. Learning Disability Quarterly, 7, 266-270.
Walz, L., Albus, D., Thompson, S., & Thurlow, M. (2000). Effect of a multiple day test
accommodation on the performance of special education students (Minnesota Report
34). Minneapolis: University of Minnesota, National Center on Educational Outcomes.
Watkins, M.W., & Kush, J.C. (1988). Assessment of academic skills of learning disabled
students with classroom microcomputers. School Psychology Review, 17 (1), 81-88.
Webb, N. L. (1999). Alignment of science and mathematics standards and assessments in four
states (NISE Research Monograh No. 18). Madison: University of Wisconsin-Madison,
National Institute for Science Education. Washington, DC: Council of Chief State School
Officers.
Webb, N.L. (2002). An analysis of the alignment between mathematics standard and
assessments for three states. Paper presented at the American Educational Research
Association meeting in New Orleans, LA, April 1-5, 2002.
Webb, N.L., Horton, M., & O‘Neal, S. (1999). An analysis of the alignment between language
arts standards and assessments for four states. Paper presented at the American
Educational Research Association meeting in New Orleans, LA, April 1-5, 2002. 30
Welch and Dunbar, S. (this volume). Developing items and assembling test form for the
alternate assessment based on modified achievement standards (AA-MAS). In NYCC
White Paper on the Alternate Assessment Based on Modified Achievement Standards
(AA-MAS). New York: New York State Education Department.
Wiener, D. (2006). Alternate assessments measured against grade-level achievement
standards: The Massachusetts “competency portfolio” (Synthesis Report 59).
Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Considerations for an AA-MAS
Page 306
Winter, P. C. (2009). Comparing apples to apples: Challenges and approaches to establishing
the comparability of test variation. Paper presented at the annual meeting of the
National Council of Measurement in Education. San Diego, California.
Wright, N., & Wendler, C. (1994). Establishing timing limits for the new SAT for students with
disabilities. Paper presented at the Annual Meeting of the National Council on
Measurement in Education (New Orleans, LA, April 4-8, 1994).
Considerations for an AA-MAS
Page 307
CHAPTER 9
CONSTRUCTING A VALIDITY ARGUMENT FOR ALTERNATE ASSESSMENTS BASED
ON MODIFIED ACHIEVEMENT STANDARDS (AA-MAS)
Scott Marion
States are facing complex issues as they have begun developing alternate assessments
based on modified achievement standards (AA-MAS). While several researchers have been
working to improve the validity evaluations of state assessments in recent years, with a more
intense focus on alternate assessments based on alternate achievement standards (AA-AAS;
Elliott, Compton, & Roach, 2007; Marion & Pellegrino, 2006; Rabinowitz & Sato, 2005; Shafer,
2005), these challenges are just beginning to be addressed for AA-MAS. The AA-MAS requires
a more careful validity evaluation than one might undertake for either the AA-AAS or the general
assessment. This is due, in part, to the uncertain conceptual framework supporting this
assessment initiative as well as the novelty of the enterprise. This does not downplay the need
for validity work on the general and other alternate assessments; rather the lack of conceptual
grounding in the case of the AA-MAS requires a thorough validity evaluation. This evaluation
should provide the state with information about how to improve the program or even to help the
state determine if the AA-MAS is ―worth it.‖ That is, do the benefits (instructional, assessment,
accountability, and social justice) outweigh the costs, including negative unintended
consequences, of implementing an AA-MAS?
Many writers of technical reports for general assessments nominally align their analyses
and results with the Standards for Educational and Psychological Testing (American
Educational Research Association [AERA], American Psychological Association [APA], National
Council on Measurement in Education [NCME], 1999), particularly when there are student or
school stakes requiring that the inferences drawn from the assessment be valid, reliable, and
fair (AERA, APA, & NCME, 1999). This is an obvious and important first step, but one that is
often not fully met. Leading measurement theorists (e.g., Cronbach, Messick), including the
Considerations for an AA-MAS
Page 308
authors of the 1985 and 1999 Standards (AERA, APA, & NCME, 1985, 1999) are clear that
validity is the most important technical criterion for educational assessment. Validity is defined
as the ―degree to which evidence and theory support the interpretations of the test scores
entailed by proposed uses of the test‖ (AERA, APA, & NCME, 1999, p.9). In other words, test
scores convey interpretations and inferences that must be verified by both empirical evidence
and a logical argument.
The challenge, however, has moved from having states and test contractors conduct
research/evaluation studies to investigate particular aspects of testing programs to designing
systematic validity plans for evaluating the efficacy of comprehensive validity arguments. This
approach requires synthesizing the various empirical results against a theory of action and
validity argument (Kane, 2006). This chapter, drawing heavily on Kane (2006), outlines a
framework for constructing and evaluating a validity argument for a state‘s alternate assessment
on modified achievement standards (AA-AAS), by first briefly describing Kane‘s argumentbased approach to validation in general and as applied to alternate assessment specifically and
then presenting strategies for organizing and prioritizing validity evaluations. The last part of the
chapter summarizes the types of evidence one might collect as part of such an evaluation.
Examples are presented throughout the chapter to make some of these ideas more concrete.
Framework
The proposed validity evaluation is based on a unified conception of validity centered on
the inferences related to the construct including significant attention to the social consequences
of the assessment (Cronbach, 1971, Messick, 1989, Shepard, 1993). Kane‘s (2006) argumentbased approach serves as the focus because it offers several pragmatic advantages over
evaluations based in the construct model, primarily in terms of prioritizing studies and
synthesizing the results of the various studies. At its simplest, Kane‘s approach asks the
evaluator to search for and evaluate all the threats to the validity of the assessment inferences.
Considerations for an AA-MAS
Page 309
If these threats are not substantiated, the inferences drawn from the assessment results may be
supported, at least tentatively. Unfortunately, ―tentatively‖ is the best that can be accomplished
with these sorts of falsification-based endeavors. The term validity evaluation is used to
encompass the interpretative and validity arguments (discussed below), the plan for conducting
various validity studies, the studies themselves, and the evaluation of the results.
Why an Argument?
Kane‘s (2006) argument-based framework ―…assumes that the proposed interpretations
and uses will be explicitly stated as an argument, or network of inferences and supporting
assumptions, leading from observations to the conclusions and decisions. ―Validation involves
an appraisal of the coherence of this argument and of the plausibility of its inferences and
assumptions‖ (p. 17). A validity argument serves to organize studies, provides a framework for
analysis and synthesis, and forces critical evaluation of claims using a falsification orientation.
For example, part of a validity argument for an AA-MAS should relate to the claim that the
modified assessment is measuring ―grade-level‖ knowledge and skills. The content-related
evidence then should include information that would allow one to challenge this grade-level
claim if, in fact, the test was measuring below grade-level content. An argument-based
approach requires the user, developer, and/or evaluator to search for reasons why the intended
inferences are NOT supported. Obviously, in practice one cannot search for ALL reasons, so
there is a need to prioritize studies. There are several approaches for prioritizing the studies, but
using the theory of action and classes of evidence, both discussed later, offer useful frames for
thinking about how to prioritize the considerable number of potential interesting studies.
Kane’s Argument-Based Framework
Kane proposed using two types of arguments: an interpretative argument and a validity
argument. According to Kane (2006), ―an interpretative argument specifies the proposed
interpretations and uses of test results by laying out the network of inferences and assumptions
Considerations for an AA-MAS
Page 310
leading to the observed performances to the conclusions and decisions based on the
performances, [while] the validity argument provides an evaluation of the interpretative
argument‖ (p 17). In other words, the interpretative argument outlines what the user/evaluator
thinks should occur (and why it should occur) as a result of the testing and related systemic
endeavors, while the validity argument is essentially the conclusions drawn after weighing the
available evidence and logic. A major advantage of Kane‘s approach is that it provides a more
pragmatic approach to validation than the construct model. Explicitly specifying the proposed
interpretations and uses of the assessment (system), developing a measurement procedure
consistent with these proposed uses, and then critically evaluating the plausibility of the initial
assumptions and resulting inferences is somewhat more straightforward than evaluating the
validity of an assessment under a construct model. This does not mean that construct validity is
not the focus of the validity evaluation. Kane‘s approach simply provides a different orientation
and more pragmatic approach for evaluating the validity of the score inferences than under a
strict construct model. The construct model is based on more of a research approach where
one is searching for causal connections, whereas Kane‘s argument-based approach works from
an evaluation perspective where one is trying to determine whether a program is operating as
intended with minimal unintended consequences.
Kane (2006) pushes for the development of the interpretative argument in the
assessment design phase. The notion of specifying purposes and uses up front and then
designing an assessment to fit these intentions is certainly not a new idea. However, designing
a fully coherent system built on a sound theoretical model of learning and use has been
receiving more attention in the last decade, in part as a result of the publication of Knowing
What Students Know (Pellegrino, Chudowsky, & Glaser, 2001; see also Pellegrino, Chapter 4,
this volume). Unfortunately most assessments do not start from an explicit attention to validity in
the design phase so many current-day evaluators working with states are put in the position of
having to retrofit a validity argument to the existing system. However, in the case of the AA-
Considerations for an AA-MAS
Page 311
MAS, there is no excuse—since the work is so new—for not starting the validity work at the
beginning of the design phase. For example, Pellegrino (Chapter 4, this volume) provides an
extensive set of examples showing how understanding the ways in which students develop
competence in the domain should guide assessment development.
The Interpretative Argument
The interpretative argument is essentially a mini-theory as it provides a framework for
interpretation and use of test scores. Like theory, the interpretative argument guides the data
collection and methods for conducting the validity analyses. Most importantly, theories are
falsifiable and making the connection between the interpretative argument and ―mini-theory‖ is
intended to emphasize that validation is not a confirmationist exercise. It is helpful to think of the
interpretative argument as a series of ―if-then‖ statements, such as, if the student is
appropriately selected to participate in the AA-MAS, then the observed score will more
accurately reflect the student’s grade level knowledge and skills.
Kane (2006) noted two stages of the interpretative argument. The development stage
focuses on the development of measurement tools and procedures as well as the
corresponding interpretative argument. Kane (2006) suggested that it is appropriate to have a
confirmationist bias (a stance that favors evidence and interpretations supporting the current
state of the assessment system) in this stage since the developers (state personnel and
contractors) are trying to make the program as good as possible. During the appraisal stage
Kane argues that there should be more of a focus on critical evaluation of the interpretative
argument. This should be a more neutral and ―arms-length‖ standpoint to provide a more
convincing evaluation of the proposed interpretations and uses. However, given the uncertain
conceptual foundations of the AA-MAS, it will be important to temper Kane‘s allowance of a
confirmationist bias during any stage and consider adopting a more critical stance throughout
the validity evaluation.
Considerations for an AA-MAS
Page 312
One of the most effective challenges to interpretative arguments (or scientific theories, in
general) is to propose and substantiate an alternative argument that is equally or more plausible
than the proposed proposition (or hypothesis in terms of scientific theory). With AA-MAS, users
must seriously consider and challenge themselves with competing alternative explanations for
test scores. For example, one might want to propose (and confirm) that increases in students
scoring at the proficient level on the AA-MAS who were not proficient previously on the general
assessment reflects the fact that the modifications made on the AA-MAS allowed the student to
better show what they know on the same constructs. However, the evaluator must consider
plausible alternative hypotheses such as increases in students scoring at the proficient level on
the AA-MAS who were not proficient previously on the general assessment might be due to
developing an easier test so students answered more items correctly but on a reduced range of
constructs and difficulty.
Bringing this back to a more simple and pragmatic level, test validation is the process of
offering assertions (propositions) about a test or a testing program and then collecting data and
posing logical arguments to refute those assertions. Using the assertion and alternate
hypothesis in the example above, the evaluator should design studies that evaluate the rigor of
the test using some form of cognitive interview to judge whether student responses reflect
differences in demonstrated knowledge and skills when comparing the general and modified
assessments. The evaluator would then analyze these data in light of both the original and
alternative hypotheses. In essence, validity evaluators are continually trying to challenge the
supportability of the claims put forth about the testing program.
Values and Consequences
Kane and others suggest that the evaluator must attend to values and consequences
when evaluating a decision procedure such as when a testing program is used as a policy
instrument as is the case with essentially all state tests. When conducting such a validity
Considerations for an AA-MAS
Page 313
evaluation, the values inherent in the testing program must be made explicit and the
consequences of the decisions as a result of test scores must be evaluated.
There might be a lingering theoretical debate about whether consequences are integral
to construct validity, but most leading validity theorists (e.g., Cronbach, 1971; Lane & Stone,
2002; Linn, Baker, & Dunbar, 1991; Messick, 1989, 1995; Shepard, 1997) have argued
convincingly that consequences are as much a part of validity as is content or any other source
of evidence. However, whether or not one agrees with this view of validity, alternate
assessments are used for important policy decisions and the consequences of these decisions
must be considered in validity evaluations. This is especially true when evaluating the validity of
an AA-MAS where stakeholders and evaluators must be particularly attentive to unintended
negative consequences that may arise from lower expectations or other potential
denied/reduced opportunities for grade-level instruction.
Guiding Philosophy, Purposes, and Uses
It has become axiomatic to say that the validity of an assessment (actually the
inferences from the assessment scores) can be judged only in the context of specified purposes
and uses. Further, the guiding philosophy must be considered when evaluating the validity of
the AA-MAS. The term ‗guiding philosophy‖ is used here in the same way that Quenemoen
(Chapter 2, this volume) used it earlier. It is meant to describe a particular orientation, set of
assumptions, and beliefs about a particular program or policy. For example, if state leaders
believe that students eligible for the AA-MAS can score at a level comparable to proficient on
the general assessment except that their disability interacts with their chances to show what
they know and/or they have not yet been well instructed, then that would lead to certain types of
assessment designs and validity arguments. On the other hand, if the leaders believe that
eligible students would have little chance, even if well instructed, to score at a level comparable
to the proficient score on the general assessment, then that would lead to quite a different
Considerations for an AA-MAS
Page 314
assessment design. The discussions in this white paper are much more aligned with the first
example than the second, but the point here is that state leaders need to be explicit and honest
about the philosophy behind their decision to develop an AA-MAS.
A state‘s guiding philosophy should help explain what the state envisions for the
relationship among the AA-AAS, AA-MAS, and the general assessment. Most states, as well as
the USED regulations, place the AA-MAS closer to the general assessment than the AA-AAS,
because both are designed to measure grade level standards, but some state policymakers
apparently see the AA-MAS as a true intermediary between the AA-AAS and the general
assessment. Again, it is important for the state to explicitly articulate these connections.
The purposes should be conceptually coherent with the state‘s guiding philosophy. For
example, if the state is interested in developing the AA-MAS so that the targeted students can
―better show what they know‖, it would lead to one type of argument and theory of action.
Whereas, if the state implemented an AA-MAS in order to better align the assessment with the
current learning opportunities and beliefs about how eligible students learn, it would lead to
another type of validity evaluation. More perversely, some states could be implementing an AAMAS to ease accountability pressures on schools associated with the performance of students
with disabilities. However, it is doubtful that such states will be explicit about these sorts of
goals.
Uses follow, in terms of the validity argument, from the state‘s guiding philosophy and
purposes. In New York‘s case, the results of the AA-MAS will be used to determine students‘
achievement levels for the accountability system, particularly for AYP determinations. The
Board of Regents and the New York State Education Department will have to decide whether
and how the results of the AA-MAS will be used for graduation determinations, particularly in
terms of eligibility for a Regent‘s diploma. However, NYSED would like these assessments to
have some instructional value as well. These potential uses — discussed in considerable detail
in the next chapter (see Domaleski, Chapter 10, this volume) — have significant implications for
Considerations for an AA-MAS
Page 315
the evaluation of the validity of the AA-MAS. If the scores from the AA-MAS are to be treated as
comparable for the purposes of a Regent‘s diploma, then certain types of comparability studies
should be incorporated in the validity evaluation (see Abedi, Chapter 8, this volume for an
extensive treatment of comparability). On the other hand, if participation in the AA-MAS shuts
off the opportunity for a Regent‘s diploma, an evaluator should consider certain types of studies
to examine the unintended negative consequences.
A Theory of Action: The Starting Point for an Interpretative Argument
Katherine Ryan (2002) and others have suggested that having state leaders (or other
assessment stakeholders) lay out a more general ―theory of action‖ can be a useful starting
point for developing a more complete interpretative argument. This theory of action is really a
simplified interpretative argument that requires the explication of the intended components of an
assessment and decision system as well as the mechanisms by which a test user could
reasonably expect to get from one step to the next. Developing a theory of action for any
validation, evaluation, or test development activity is a useful exercise. Given the field‘s lack of
clarity around the AA-MAS, a well developed theory of action is perhaps even more critical than
it might be for other validation initiatives. Policymakers, developers, stakeholders, and
technicians should have to very explicitly lay out why they think that implementing an AA-MAS
will lead to improved educational opportunities for eligible students. In addition to the ―why‖, they
should have to describe the ―how‖ or the mechanisms by which they think that these improved
learning opportunities will occur. For example, one might postulate that AA-MAS scores will be
more accurate depictions of what eligible students know than general assessment scores so
that teachers will be able to provide more appropriate learning opportunities for these students.
The evaluator and/or user must specify the mechanism by which these score reports will lead to
the anticipated changes in teaching practices, such as targeted instruction and/or more
appropriate curricular materials.
Considerations for an AA-MAS
Page 316
Based on two example guiding philosophies presented in Chapter 2 (Quenemoen, this
volume), two example theories of actions for a modified assessment system were created to
illustrate how these differences could play out as different validity arguments. These examples
were purposefully created to represent two quite different guiding philosophies and approaches
to the AA-MAS.
Example # 1: The AA-MAS allows eligible students to show that what they know may be
comparable to similar performance levels on the general assessment.
1. Academic content standards are the same as for the general assessment, and the test
blueprint for the AA-MAS is essentially the same as that for the general assessment, but
contains some modifications (e.g., fewer passages) to make adjustments for students‘
disabilities and includes slightly less difficult items than on the general assessment.
2. The achievement standards incorporate recognition of students‘ disabilities (e.g., need
for supports) and while they signal high expectations for eligible students and their
teachers, they are slightly lower than the general assessment achievement standards.
3. The assessment is designed to measure grade-level content and high achievement
expectations, accurately allowing students to show what they know as well as what they
do not know and are able to do.
4. Teachers provide instruction that is aligned with these high academic expectations and
ensure that students get the supports necessary allowing them to succeed with gradelevel content.
5. The test and achievement descriptors signal and reinforce appropriate instructional and
formative assessment strategies for use in classrooms/schools.
6. Student scores on the AA-MAS provide a more accurate estimate of what eligible
students know and can do compared with the general assessment.
7. Student performance on the test is used by teachers and school leaders to help them
figure out how to provide more appropriate supports and programs.
Considerations for an AA-MAS
Page 317
8. Improved student/school performance on the AA-MAS leads to higher accountability
scores.
Example #2: The AA-MAS will better align with current learning opportunities and beliefs about
how eligible special education students learn grade-level academic content.
1. Academic content standards are the same as for the general assessment, but the test
blueprint for the AA-MAS focuses on fewer and generally easier items tailored to the
lower expectations held for these students. The blueprint and test specifications also
contain some modifications (e.g., fewer passages) to make adjustments for students‘
disabilities.
2. The achievement standards incorporate references to students‘ disabilities (e.g., need
for supports) and are designed to describe eligible students‘ knowledge and skills
relative to their current learning opportunities.
3. Teachers provide instruction that is designed to take students from where they are and
then helps the students make progress in this curriculum even if it is below grade level.
4. The assessment is designed to provide measurement information about where students
are performing, relative to grade-level content, to better show what they know and are
able to do.
5. The test and achievement descriptors signal the appropriate levels and types of
instructional and formative assessment strategies for use in classrooms/schools.
6. Student performance on the test is used by teachers and school leaders to support
(validate) current supports and programs.
7. The AA-MAS scores provide information about students‘ current performance to the
student, parents, and teachers.
8. A test more aligned to students‘ instructional levels leads to more proficient students with
disabilities and higher accountability scores.
Considerations for an AA-MAS
Page 318
9. Students, in part because of these lower expectations, do not make progress on gradelevel standards relative to their same grade peers and certain opportunities are shut off
from these students by virtue of these missed (or denied) opportunities (e.g., a Regent‘s
Diploma).
Each aspect of the theory of action leads to claims or propositions that are the basis of
the interpretative argument. For example, a proposition such as, ―students of teachers using
formative assessment strategies aligned with the AA-MAS targets have higher scores than
students of teachers using formative assessments not matched with the AA-MAS targets, could
be specified from the general claim found in the first example theory of action presented in
Figure 9-1, ―the AA-MAS reinforces appropriate instructional and formative assessment
strategies for use in classrooms/schools.‖ An interpretative argument will start with one or more
of the goals and guiding philosophy discussed above and then trace the claims of the AA-MAS
that results in meeting that goal. Specifying a theory of action is a useful first step in creating a
more complete interpretative argument. Sample theories of action were developed in the form of
pictures shown in Figures 9-1 and 9-2. However, a theory of action, particular when laid out
graphically as in the examples here, is of limited utility. It is necessarily quite broad — perhaps
superficial—and therefore on its own, cannot guide a comprehensive validity evaluation.
Evaluators must ―zoom in‖ on specific components and linkages within the theory of action in
order to explicate the propositions/assertions that form the basis of the interpretative argument.
Examples of such propositions are presented below in the evidence section. Further, when test
users (e.g., states) and developers create theories of action, there is often little emphasis on
negative, unintended consequences. Example #2 above was created to illustrate the importance
of searching for and trying to uncover negative, unintended consequences, but evaluators
should adopt this stance for any interpretative argument and validity evaluation plan.
Considerations for an AA-MAS
Page 319
Figure 9-1. Example #1 Theory of Action
Same academic
content standards
as the general
assessment
The AA-MAS reinforce
appropriate instructional and
formative assessment
strategies for use in
classrooms/schools
Test performance is
used by educators to
provide more
appropriate supports
and programs
Essentially same
blueprint as general
assessment with certain
modifications
The assessment is
designed to measure
grade-level content and
high achievement
expectations accurately
Teachers provide
instruction aligned with
academic
expectations
Student scores on
the AA-MAS
provide an
accurate estimate
of what students
know
Achievement standards
include recognition of specific
supports, but focus on high
expectations
Considerations for an AA-MAS
Page 320
Improved
student/school
performance on the
AA-MAS leads to
higher accountability
scores
Figure 9-2. Example #2 Theory of Action
Same academic
content standards
as the general
assessment
Test blueprint for the
AA-MAS focuses on
fewer and generally
easier items than
general test
The assessment is designed to
provide information of gradelevel content at students‘ current
levels to better show what they
know now
The AA-MAS signals the
appropriate levels of
instructional and formative
assessment strategies for
use in classrooms/schools
Test performance is
used by educators to
support current
approaches and
programs
This AA-MAS leads to
more proficient scores
(compared to example
#1) and higher
accountability scores
Teachers provide
instruction to help the
students make progress in
the current curriculum
even if it is below grade
level
Students do not make progress
on grade-level standards relative
to their same grade peers and
are denied certain opportunities
Achievement standards
include recognition of
specific supports, but
focus on current learning
opportunities
Considerations for an AA-MAS
Page 321
Both examples separate out the various claims by the stage of the assessment or
accountability process. Both of these theories of action start with the purposes of the
assessment, move to content and achievement standards, and then to assessment
development (e.g., test blueprint), and end with claims about uses and consequences of the
scores. The end result is the goal of increasing student achievement or at least test/
accountability scores. An interim goal is to provide information to the teachers to help them
improve how they structure learning opportunities for these students. Importantly, these two
theories of action lead to different social justice claims, which have implications for the collection
and evaluation of consequential evidence.
Prioritizing the Validity Evaluation Questions
The interpretative arguments and the more general theories of action lead to many
possible evaluation questions — almost always more than can be addressed in a validity
evaluation constrained by time and/or resources. The prioritization should be influenced by the
particular guiding philosophy. Following Kane (2006), the state should not select questions and
design a validity evaluation to confirm their guiding philosophy. Rather, the validity evaluator
should purposefully design studies to contradict the states‘ beliefs and claims.
While being wary of potential bias, the state can use the guiding philosophy to help
prioritize the multitude of possible evaluation questions. A state that adopts a guiding philosophy
similar to example #1 should certainly prioritize validity questions addressing comparability of
inferences (again, see Abedi, Chapter 8, this volume, for more detail). The evaluator, in this
case, should search for proof of concept cases where well-instructed students do in fact perform
at levels comparable to students participating on the general assessment. The absence of such
cases would be a threat to the guiding philosophy and validity argument found in example #1.
On the other hand, a state subscribing to the philosophy articulated in example #2 would have
to focus on content validity studies to document that the test actually meets the regulatory
Considerations for an AA-MAS
Page 322
requirements of being on grade level. This evaluation should also collect consequential
evidence about students‘ opportunities to learn meaningful grade level content and skills.
Classes and Sources of Evidence
There are many ways to organize and collect evidence for the validity evaluation. The
joint Standards’ (AERA, et al., 1999) five sources of evidence are the most familiar organizing
framework. Earlier work (Marion & Pellegrino, 2006; Marion & Perie, in press) has illustrated
how both the assessment triangle (Pellegrino, et al, 2001) and Ryan‘s (2002) framework could
be used to structure validity evaluations. The joint standards are used as the basis here
because of both their familiarity and straightforward structure. However, the current (1999)
version of the joint standards does not do justice to certain key elements illuminated by the
assessment triangle (Pellegrino, et al., 2001), particularly related to the ―cognition‖ vertex of the
triangle. Further, the 1999 edition of the joint standards does not fully incorporate recent
research making clear the central role of test consequences into validity evaluations (e.g., Lane
& Stone, 2002; Shepard, 1997). Therefore, an introductory section was added to this discussion
to address ―who are the students?‖ and ―how do they acquire proficiency in the domain?‖ to
supplement the joint standards framework. While this type of information should be part of any
validity evaluation, it is even more important in alternate assessment and English language
learner testing contexts where the specific tested population could vary considerably depending
on the selection rules employed. Further, the framework presented here prioritizes the role of
test consequences in the evaluation of AA-MAS validity more than the joint standards would
suggest. Within each of the following categories, the sources of evidence and types of studies
particularly relevant to evaluating the validity of the AA-MAS are described. Several examples
are presented throughout the following sections illustrating how specific propositions and study
designs might differ depending on the specific guiding philosophies and theories of action.
Considerations for an AA-MAS
Page 323
Who Are the Students and How Do They Learn?
As Quenemoen (Chapter 2, this volume) makes clear, identifying students for
participation in the AA-MAS is a complex endeavor. A key eligibility requirement is that students
must be instructed in the grade-level curriculum and have an opportunity to learn grade-level
content (Quenemoen, Chapter 2, this volume). A major premise associated with implementing
an AA-MAS is that students‘ disabilities interact with their capacity to demonstrate what they
know and are able to do and that poor performance is not due to a lack of opportunity to learn.
Karvonen (Chapter 3, this volume) discussed methods for documenting the effectiveness of
instructional and curriculum strategies. This documentation is crucial evidence to help make the
case that students have been appropriately selected to participate in the AA-MAS. Further, IEP
teams need to ensure that appropriate supports and strategies are provided so that students
have the highest likelihood possible to access the grade-level knowledge and skills.
Pellegrino (Chapter 4, this volume) provides a thorough and excellent discussion about
the ways in which students acquire competence in a domain, with a specific focus on
mathematics. Pellegrino‘s exposition is very important for states to keep in mind as they
consider developing an AA-MAS, because if state leaders do not have a sense of how eligible
students will make progress in the domain, then the rationale for and the validity of the AA-MAS
will be suspect. Therefore, a critical aspect of the interpretative argument is the development of
propositions related to the way in which students develop domain competence. The theoretical
conceptions and the associated evidence—such as the results from tasks specifically designed
to measure students‘ progress along a defined learning continuum—should be evaluated as
part of the larger validity investigation for any assessment system, but even more so for the AAMAS because of the field‘s limited understanding of the conceptual underpinnings of this
assessment.
Considerations for an AA-MAS
Page 324
Evidence Based on Test Content
Important validity evidence can be obtained from an analysis of the relationship
between a test’s content and the construct it is intended to measure. Test content
refers to the themes, wording, and format of the test items, tasks, or questions on
a test, as well as the guidelines for procedures regarding administration and
scoring (AERA, et al., 1999, p.11).
One of the foundational principles of the AA-MAS is that it is based upon ―grade-level‖
content. Therefore, collecting and evaluating the evidence regarding comparability of the
content is critically important to evaluating the validity of the AA-MAS. Many states and
evaluators will often use evidence from alignment studies to support claims of content validity.
Well done alignment studies can certainly contribute to content related validity evaluations, but
alignment studies generally focus on matching test items with content-based standards and
objectives. Content-related evidence, especially when one is trying to make claims about ―grade
levelness‖ requires evaluating the interaction of both content and process required of the test
items and, in the case of the AA-MAS, documenting that the interaction is what is expected for
the specific grade level.
In both example theories of action presented earlier, the assessments are based on the
state‘s academic content, but the blueprint described in the second example is based on fewer
and easier items than the general assessment. In this case, the evaluator should critically
evaluate the assertion that the test blueprint used in Example #2 accurately represents the
construct even though a purposeful non-representative item sampling (of grade level content)
approach is used. Even in Example #1, studies should address the assertion that the use of
certain changes (modifications) to the test blueprint (and items) accurately represents gradelevel knowledge and skills as indicated by the content standards.
Considerations for an AA-MAS
Page 325
Evidence Based on Response Processes
Theoretical and empirical analyses of the response processes of test takers can
provide evidence concerning the fit between the construct and the detailed
nature of performance or response actually engaged in by examinees (AERA, et
al., p 12).
Many validity questions emerge from a state‘s belief that implementing an AA-MAS will
better allow students to show what they know can be grouped under response processes.
These types of studies would be applicable for guiding philosophies aligned with either example
#1 or #2, but the orientation would be different depending on the underlying beliefs. Several
studies can be identified where students‘ response behaviors are compared between AA-MAS
and general assessment items with an attempt to attribute the differences to specific theoretical
conceptions outlined in the rationale for the AA-MAS.
Evidence related to the cognition vertex of the assessment triangle can also be
considered within the responses process category. Before analyzing evidence on how students
are responding to specific tasks, it is crucial to describe and analyze which students have been
nominated to participate in the AA-MAS. There is an implicit assumption in both theories of
action that the ―right‖ students are participating in the AA-MAS—an assumption that should be
made explicit in a more complete or elaborate theory of action—but this assumption should be
evaluated before investigating how students are responding to the items. States should have
(and present) a theoretically-grounded rationale as part of the description of the students
participating in the AA-MAS.
Another important dimension of the cognition vertex subsumed by the response process
category is a description of how students acquire competence (proficiency) in the domain. If
there is such a hypothesized progression by which students are expected to develop domain
competence, the evaluator/state should describe how students eligible for the AA-MAS are
expected to follow the same expected progression or how and why they would develop
Considerations for an AA-MAS
Page 326
differently than their same-age peers. The tasks and associated response processes could then
be evaluated against hypothesized learning progressions.
Evidence related to response processes is often collected through the use of cognitive
laboratories (―think-alouds‖) to get a micro look at how students are interacting with the items
and tasks (e.g., Ericsson & Simon, 1980; Johnstone, Bottsford-Miller, & Thompson, 2006). The
data derived from well-designed cognitive laboratories can shed light on students‘ developing
understanding in the grade-level content as a way to ascertain whether the items and tasks on
AA-MAS support this developing understanding.
In the case of the AA-MAS, it will be important to determine whether students interact as
intended with the modified test items and in ways that differ from the non-modified test items.
Students interact with passages and test items on the AA-MAS in ways that allow them to
demonstrate their grade-level knowledge and skills while minimizing construct irrelevant
influences is a proposition that would fit both theories of action. The main difference in how this
assertion might be tested in the two examples would play out in the different passages, items,
and tasks.
Internal Structure
Analyses of the internal structure of a test can indicate the degree to which the
relationships among test items and test components conform to the construct on
which the proposed test score interpretations are based (AERA, et al., p. 13).
For states with a guiding philosophy similar to example #1, this section is critical for
evaluating the validity of the AA-MAS as an important set of evidence in terms of score
comparability. The internal structure of the AA-MAS should be similar to the internal structure of
the general assessment or if not, there should be an explicit reason why the internal structures
of the two assessments differ. As discussed by Abedi (Chapter 8, this volume), meeting strict
comparability criteria (i.e., equating) is generally beyond the reach of almost any AA-MAS
Considerations for an AA-MAS
Page 327
design. Yet, techniques such as confirmatory factor analysis could be used to compare the
internal structure of the general and modified assessments to determine if the structure of the
modified assessment is ―close enough‖ to the general assessment to argue that both are
tapping the same construct. A proposition from the perspective of Example #2 might suggest,
the internal structure of the AA-MAS is generally similar to that of the general assessment, while
one from the perspective of Example #1 would argue for stronger comparability such as, the
same factor structure can be used to explain the variability of the items on both the AA-MAS
and general assessment.
Evidence Based on Relations to Other Variables
Analyses of the relationship of test scores to variables external to the test provide
another important source of validity evidence. External variables may include
measures of some criteria that the test is expected to predict, as well as
relationships to other tests hypothesized to measure the same constructs, and
tests measuring related or different constructs (AERA, et al., p. 13).
This section is probably less critical for the AA-MAS compared with other sources of
evidence, but it can still be important—depending on one‘s theory of action—to substantiate
claims about the AA-MAS. There are other assessments with documented properties (e.g.,
grade level or not; difficult or easy, accessible or not) that should be more or less related to
scores on the AA-MAS. Since psychometricians are quite good at computing correlations, state
leaders and evaluators should articulate the intended relationships a priori instead of data
snooping for relationships that support one‘s conclusions.
Assuming there is an attempt to ensure that the AA-MAS is measuring the same
construct as the general assessment, it is difficult to imagine significant differences in the
relationship to some external test or other variable. If the state had good longitudinal data on
norm-referenced tests or interim assessments, for example, the state might want to put forth a
Considerations for an AA-MAS
Page 328
proposition for Example #2 to gather validity evidence that supports the claim that the AA-MAS
is on grade level. Such proposition might state, the fourth grade AA-MAS is significantly more
related to the fourth grade NRT than it is to the 3rd grade NRT. This proposition could be
extended to argue that the correlations between the AA-MAS and the external criterion should
be very similar to the correlations between the general assessment and the external test.
Evidence Based on Consequences of Testing
There are a host of intended positive consequences associated with a state‘s interest in
implementing an AA-MAS, but there are some serious potentially unintended negative
consequences. As discussed above, states that have an orientation similar to that in example
#2 should focus consequential questions on potential lower expectations that could hinder
eligible students from achieving at grade level. A state‘s approach does not have to be as
extreme as presented in example #2 for the system to carry unintended negative
consequences, therefore state leaders and evaluators need to attend to unintended
consequences related to lower expectations for any AA-MAS. On the other hand, states with a
philosophy similar to example #1 might address consequential issues related to frustration
and/or lack of a meaningful assessment experience from ―unrealistically‖ high expectations. In
any case, consequential studies related to the validity of the AA-MAS need to focus on, in large
part, searching for and evaluating the potential unintended consequences of an AA-MAS such
as lower expectations for students with disabilities.
The AA-MAS was originally conceived as part of the flexibility offered by the U. S.
Department of Education under NCLB and ultimately this assessment has been designed to fit
into states‘ accountability systems and contribute to Adequate Yearly Progress (AYP)
determinations (see Domaleski, Chapter 10, this volume, for more discussion of AA-MAS
accountability issues). The accountability function makes clear that the AA-MAS has been
designed, at least as one purpose, as a policy instrument. As Kane (2006) noted, when
Considerations for an AA-MAS
Page 329
assessments are used to support a particular policy, the consequences of such policy actions
must be incorporated into the validity evaluation. A range of validity evaluation questions and
propositions could be put forth to collect consequential evidence related to the AA-MAS. These
questions and propositions will differ depending on the philosophies and goals guiding the
development and implementation of the AA-MAS. For instance, a proposition to search for
potential unintended negative consequences based on Example #2 might read as follows: the
increase in the percentage of special education students scoring proficient as a result
participating in the AA-MAS has not led to an increase in schools falsely meeting AYP targets
(Type II errors).
Synthesis and Evaluation
Haertel (1999) reinforced the notion that individual pieces of evidence (typically
presented in separate chapters of technical documents) do not make an assessment system
valid or not. The evidence and logic must be synthesized to evaluate the interpretative
argument. As Kane (2006) indicated, the evaluative argument provides the structure for
evaluating the merits of the interpretative argument. Various types of empirical evidence and
logical argument must be integrated and synthesized into an evaluative judgment; this process
can be a challenging intellectual activity. In state assessment programs, when new and varied
information comes in at sometimes unpredictable intervals, the challenge is exacerbated. With
alternate assessment programs, not only is new evidence being collected along the way, but
actual understanding of alternate assessments and the students they serve evolves much more
rapidly than in many other programs. This evolving understanding will require evaluators to
(re)examine evidence in light of these newer understandings.
With the exception of a few states, most AA-MAS are in the very early stages of
development. Therefore, initial syntheses could adopt confirmationist biases during the first few
years of the program until it gets established. This does not mean that long-term studies,
Considerations for an AA-MAS
Page 330
especially consequential, should not be planned and initial data collected, but the synthesis and
evaluation in the early years of the program should focus on substantiating that the
development of the AA-MAS has generally occurred as designed and the designs can be
theoretically supported.
Dynamic Evaluation
In almost all studies that evaluate the validity of state assessment systems, the studies
are completed across a long time span. Evaluators rarely have all the evidence in front of them
to make conclusive judgments. Therefore, evaluators must engage in ongoing, dynamic
evaluations as new evidence is produced. Working in this fashion requires, even more so than
in more predictable evaluations, that each proposition be written to allow judgment of whether
the evidence supports a particular claim. As discussed above, this always means exploring the
efficacy of alternate hypotheses. However, in the context of states‘ large assessment systems,
evaluators do not have the luxury of concluding, ―The system is not working; let‘s start over.‖
Rather, in such instances, when the evidence does not support the claims and intended
inferences, state leaders and test developers must act as if the dynamic results were from a
formative evaluation, and they must search for ways to improve the system. Of course, the
evidence might be so overwhelmingly stacked against the intended claims that the state leaders
are left only with the option of starting over.
The state should use the guiding principles and purposes of the AA-MAS to determine
how to weigh various sources of evidence to arrive at an evaluative judgment. This judgment
could take the form of a summative judgment where a state determines that the overwhelming
evidence suggests abandoning the AA-MAS or going ahead with it full steam ahead. More
likely, however, the state will use the initial validity evaluation in formative ways to improve the
AA-MAS.
Considerations for an AA-MAS
Page 331
References
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1985). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational Measurement (2nd
ed., pp. 443–507). Washington, DC: American Council on Education.
Elliott, S. N., Compton, E., Roach, A. T. (2007). Building validity evidence for scores on a statewide alternate assessment: A contrasting groups, multimethod approach. Educational
Measurement: Issues and Practice, 26(2), 30–43.
Ericsson, K. & Simon, H. (1980). Verbal reports as data. Psychological Review, 87, 215–250.
Gong, B. & Marion, S. F. (2006). Dealing with flexibility in assessments for students with
significant cognitive disabilities (Synthesis Report 60). Minneapolis, MN: University of
Minnesota, National Center for Educational Outcomes.
http://education.umn.edu/nceo/OnlinePubs/Synthesis60.html.
Haertel, E. H. (1999). Validity arguments for high-stakes testing: In search of the evidence.
Educational Measurement: Issues and Practice, 18, 4, 5–9.
Johnstone, C. J., Bottsford-Miller, N. A., & Thompson, S. J. (2006). Using the think-aloud
method (cognitive labs) to evaluate test design for students with disabilities and English
language learners (Technical Report 44). Minneapolis, MN: University of Minnesota,
National Center on Educational Outcomes. Retrieved [today's date], from the World
Wide Web: http://education.umn.edu/NCEO/OnlinePubs/Tech44/
Kane, M.T. (2006). Validation. In R.L. Brennan (Ed.), Educational measurement (4th ed., pp.
17–64). New York: American Council on Education/Macmillan.
Kearns, J., Towles-Reeves, E., Kleinert, H., & Kleinert, J. (2006). Learning Characteristics
Inventory (LCI) Report. National Alternate Assessment Center, Human Development
Institute, University of Kentucky, Lexington.
http://www.naacpartners.org/Products/Files/Research_Focus_LCI.pdf
Kleinert, H., Browder, D., & Towles-Reeves, E. (2005). The assessment triangle and students
with significant cognitive disabilities: Models of student cognition. National Alternate
Assessment Center, Human Development Institute, University of Kentucky, Lexington.
http://www.naacpartners.org/Products/Files/NAAC%20Assmt%20Triangle%20White%20
Paper%20-%20FINAL%20for%20Website.pdf
Lane, S., & Stone, C. A. (2002). Strategies of examining the consequences of assessment and
accountability programs. Educational Measurement: Issues and Practice, 21(2), 23-30.
Considerations for an AA-MAS
Page 332
Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex performance-based assessment:
Expectations and validation criteria. Educational Researcher, 20, 8, 15–21.
Marion, S.F., & Pellegrino, J.W. (2006). A validity framework for evaluating the technical quality
of alternate assessments. Educational Measurement: Issues and Practice, 25(4), 47–57.
Marion, S. F. & Perie, M. (2009). Validity arguments for alternate assessments. In Schafer, W.
and Lissitz, R. (eds.) Alternate assessments based on alternate achievement standards:
Policy, practice, and potential (pp. 115-127).
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–
103). New York: American Council on Education, Macmillan Publishing.
Messick, S. (1995). The interplay of evidence and consequences in the validation of
performance assessments. Educational Researcher, 23, 2, 13-23.
No Child Left Behind Act of 2001, Pub. L. No.107-110, 115 Stat.1425 (2002).
Pellegrino, J. W., Chudowsky, N. J., & Glaser, R. (Eds.) (2001). Knowing what students know:
The science and design of educational assessment. Washington, DC: National
Academy of Sciences.
Rabinowitz , S. & Sato, E. (2005). The technical adequacy of assessments for alternate student
populations. San Francisco: WestEd.
Ryan, K. (2002). Assessment validation in the context of high stakes assessments. Educational
Measurement: Issues and Practice, 21(1), 7–15.
Schafer, W. D. (2005). Technical documentation for alternate assessments. Practical
Assessment Research & Evaluation, 10(10).[Available online:
http://pareonline.net/getvn.asp?v=10&n=10].
Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of
Research in Education, 19, 405–450.
Shepard, L. (1997). The centrality of test use and consequences for test validity. Educational
Measurement: Issues and Practice. 16(2), 5–24.
Considerations for an AA-MAS
Page 333
CHAPTER 10
OPERATIONAL AND ACCOUNTABILITY ISSUES
Chris Domaleski
A full examination of the issues and elements related to the design and adoption of a
new state assessment program would not be complete without careful consideration of the
context in which the program will be situated. It is important to acknowledge that an alternate
assessment based on modified achievement standards (AA-MAS) would exist as part of a larger
state assessment and accountability system. Therefore, it is essential to understand the
interrelationship of the AA-MAS with other assessment programs. Moreover, the potential
impact of the AA-MAS on the state accountability system should be carefully explored.
This chapter begins with an overview of the background and context for accountability
and then summarizes the key provisions in the United States Department of Education‘s
(USED) regulations that pertain to accountability determinations. This is followed by a
discussion of the relationship of the AA-MAS to existing New York State Education Department
(NYSED) assessments. Subsequently, specific accountability issues are addressed to include
procedures to estimate reliability, and a review of key operational considerations. The chapter
ends with a discussion of factors related to student and summary reporting, and a consideration
of issues and options related to diploma eligibility.
In exploring these topics, focus will be placed on practical, technical, and policy
elements. By so doing, the goal will be to highlight options and provide guidance to assist with
implementation and evaluation.
Background and Context for Accountability
Education accountability systems, in some form or another, have been in place for at
least the previous three decades. However, earlier accountability systems tended to focus on
areas such as regulation compliance and financial management (Fuhrman, 2004). The change
Considerations for an AA-MAS
Page 334
in focus to outputs, chiefly, student performance on standardized assessments, began in
earnest in the 1980s. During this time, accountability approaches drawn from business
applications gained support from education policy makers (Fuhrman, 2004). This was bolstered
by a wave of concern about the perceived decline in quality of education as described in the
influential publication A Nation at Risk (1983). In subsequent years, accountability systems
expanded and focused more on student and school performance.
Another major influence on contemporary education accountability began in the 1990s
with increased support for standards-based reform. The guiding idea behind this approach is
that expectations for what students know and can do should be clearly established, which will
guide all other elements of the educational system, chiefly instruction and assessment (O‘Day &
Smith, 1993). Advocates argue that such an approach leads to a number of improvements such
as clarifying goals, incentivizing improvement, and informing allocation of resources (DarlingHammond, 2006). This perspective was a guiding factor behind the development and
implementation of accountability systems in the 1990s and in the current decade, including the
federal No Child Left Behind (2001) legislation.
As support increased for standards-based reform, so too did advocacy for students with
disabilities. Historically, many educators and stakeholders did not provide students with
disabilities access to the general curriculum. With recent reauthorizations of IDEA and NCLB,
the view that students should be taught and held accountable for grade-level standards
prevailed. This position has not been without opposition, from those that argued that such goals
are unreasonable and/or traditional standardized assessment practices are ill-suited for
students with disabilities.
Today, a central idea behind contemporary accountability practices is the inclusion of all
students, including students with disabilities. This is based on the belief that measuring,
reporting, and holding schools explicitly accountable for the performance of students with
disabilities is critical to ensuring that educators attend to their needs, provide appropriate
Considerations for an AA-MAS
Page 335
resources, and set high expectations for learning. The extent to which this principle holds rests
largely on the integrity of the measures used to gauge student achievement. This is the context
that has inspired the state of New York, like many other states, to explore the efficacy of
customizing a standards-based assessment for a portion of the population of students with
disabilities.
Federal Regulations
Against this backdrop, the United States Department of Education (USED) issued
regulations and guidance in April of 2007 that addressed the implementation of modified
academic achievement standards and assessments. These regulations were explicitly targeted
to a small group of students whose disability precludes them from achieving grade-level
proficiency within the year. A more complete overview of the regulations is presented in Chapter
1 (Perie, this volume). The focus of this section will be to review the elements that directly
impact accountability determinations.
In terms of accountability, there are two main elements of the policy that merit attention.
First, the regulations and guidelines establish that states may count as proficient for the purpose
of AYP calculations, the proficient and advanced scores of students with disabilities based on
an AA-MAS, provided the number of these scores do not exceed 2% of all students in the
grades assessed in language arts and mathematics. In other words, scores on the AA-MAS can
be used in adequate yearly progress (AYP) calculations in the same way as scores from the
general assessments within the 2% cap. While this seems straightforward, there are a number
of caveats and considerations that warrant further examination to fully appreciate the application
of this stricture. This will be addressed in a later section of this chapter.
The second major element of the policy with respect to accountability is the expiration of
the ‗interim-flexibility‘ policy. Interim-flexibility refers to the practice of allowing states that meet
certain criteria to count as proficient for purposes of AYP a portion of the students with
Considerations for an AA-MAS
Page 336
disabilities. This applies at the school or district level if AYP is missed solely because of the
achievement of the students with disabilities subgroup. The portion is determined by dividing 2
percent by the percent of students with disabilities in the state. For example, in the State of New
York the SWD subgroup is about 12 percent of the student population. Dividing 2 by 12 equals
17 percent. Therefore, 17 percent of the state‘s SWD population could be counted as proficient
for purposes of AYP, where applicable.
The purpose of this flexibility was to forestall the impact of non-proficient classifications
based on general assessments for students who may be candidates for an AA-MAS, during the
time that new assessments more appropriate for this population, were under development.
Importantly, the interim-flexibility, which was initially granted for the 2004–05 academic year is
extended through 2008–09 in the regulations; however, it expires beginning in the 2009–10
academic year. Whether an AA-MAS is developed or not, this will have an impact on
accountability determinations in the state of New York, which, like many states, has applied the
interim-flexibility in AYP computations. The interim-flexibility essentially allows states to count
the maximum percent of eligible students proficient in AYP computations. Consequently, when
this expires, states will likely see an increase in the number of SWD groups that fail to meet
their annual measurable objectives (AMOs).
Relationship to Existing Assessments
The state of New York has developed a comprehensive assessment system to measure
student achievement of the New York State Learning Standards and to satisfy the accountability
provisions of No Child Left Behind (NCLB). Assessments used in the accountability system are
part of the New York State Testing Program and include English/ Language Arts (ELA) in
grades 3-8, mathematics in grades 3-8, and science in grades 4 and 8. At the secondary level
the Regents English Comprehensive Exam and the Regents Integrated Algebra Exam are used
Considerations for an AA-MAS
Page 337
for AYP purposes. Moreover, the department has developed the New York State Alternate
Assessment (NYSAA) for students with significant cognitive disabilities.
The NYSED follows a development and validation process in keeping with professional
standards, partnering with assessment specialists, contractors, educators, and stakeholders.
Each item on the assessment is mapped to a performance indicator that is consistent with the
state curriculum. The elementary and intermediate ELA assessments consist of multiple-choice
and short and/or extended-response items. Some assessments also include an editing
paragraph. The Regents Comprehensive English Exam contains multiple-choice items based on
passages and stimuli, including a listening portion, as well as a constructed-response writing
prompt. The 3–8 and Integrated Algebra assessments also include multiple-choice and
constructed-response items, requiring students to generate item responses and show work.
While the Regents Examinations are given at various times throughout the year, the 3–8
assessments are typically administered in early spring term. Students take the exams in
sections or books over two to three days.
The NYSAA is a datafolio assessment that measures the achievement of students with
significant cognitive disabilities. The datafolio is a collection of evidence in response to aligned
tasks, evaluated with respect to accuracy and independence, intended to provide information
about the student‘s achievement. NYSAA tasks are aligned to Alternate Grade Level Indicators
(AGLIs) which are entry points to grade-level expectations in the New York State learning
standards.
Grades and Content Areas for AA-MAS
An important decision for the NYSED is the determination of the grades and content
areas in which to implement an AA-MAS. There is no regulatory requirement to develop or
adopt an AA-MAS, so the potential implementation options range from none to all. That is, the
NYSED may decide not to proceed with an AA-MAS in any area or to pursue full adoption in all
Considerations for an AA-MAS
Page 338
grades and content areas assessed, regardless of inclusion in NCLB accountability. Naturally, a
number of implementation options in between these two extremes are available as well.
A decision about scope of development is, foremost, a policy decision that should be
guided by the goals of the NYSED and the purpose for considering an AA-MAS. Assuming it is
desirable to implement an AA-MAS as broadly as possible, there are at least three possible
perspectives that might guide prioritization of implementation.
First, the extent to which the general assessments are seen as valid and appropriate for
students with disabilities could be a guiding principle. By carefully evaluating both the
assessment characteristics and student performance, the state might develop priorities for the
grades and content areas that should be given primary consideration. For example, one may
wish to review blueprints and specifications for the general assessments to determine which are
relatively more cognitively complex and/or rigorous. Moreover, one may wish to review the gap
between performance of students with disabilities and general education students, and focus on
the assessments that have the largest gap. When these two approaches identify the same
assessments, a more compelling case for prioritizing these assessments may be made.
Additionally, there may be legal issues to consider. If a general assessment is regarded
as not suitable for students with disabilities, the state may be legally compelled to pursue the
development of an alternate assessment. This position was supported by Chapman v. California
Department of Education (2002) in which a federal court ruled that the state of California must
provide an alternate assessment if it is determined that students with disabilities are unable to
access the general assessment due to their disability.
A second approach may be to allow the consequences or stakes associated with the
assessment to guide prioritization of the grades and content areas in which an AA-MAS should
be developed. Using this orientation, those areas covered in the state accountability system
(ELA and mathematics in 3-8 and high school) may be given higher priority. There may be other
stakes, either currently in place or planned, that could guide this decision. These may include
Considerations for an AA-MAS
Page 339
student stakes, such as diploma eligibility, or rewards/consequences at the teacher, school, or
system level.
A third lens through which to view this decision is related to practical or operational
constraints. Unavoidably, the availability of resources, such as cost and staff capacity, has a
significant impact on options that can be considered. Such factors as the format of the
assessment, the frequency of administration, or the scope of ongoing development and support,
may make some options more feasible than others.
It is important to acknowledge that these three approaches are not mutually exclusive
and most likely will interact with each other. For example, assuming resources are limited, the
NYSED may get a sense for the scope of implementation which may narrow options down to a
specific program or grade span. Thereafter, it may be reasonable to consider the policy
implications, then, review the properties and performance of the assessments to further identify
the area in which to begin implementation. Another important consideration is whether or not the
state would like to use scores on the general assessment to inform placement on the AA-MAS.
If so, then it will be important to introduce the AA-MAS at a later grade to acquire score(s) on
one or more years of the general assessment.
The state of New York may approach this decision as a cost-benefit analysis. The costs
of implementing an AA-MAS are related to finances, operational burden to state and local staff,
and possible forfeiture of other programs and initiatives that could be supported by these
resources. On the other hand, the benefits may include improved information from assessment
and accountability systems and the ability to promote student achievement for students with
disabilities.
Although each state likely differs with respect to a number of the factors previously
examined, it may be useful to examine the scope of AA-MAS implementation in other states. In
2007 the National Center on Educational Outcomes (NCEO) reviewed the characteristics of the
AA-MAS for six states, including the grades and content areas that were addressed. The results
Considerations for an AA-MAS
Page 340
are presented in table 10-1 below, reproduced from that report. The results show that most
states implemented the AA-MAS fairly broadly. Each state included reading and mathematics at
the elementary level and all but one state (North Carolina) offered an AA-MAS in these areas
and at the secondary level. Many states also implemented the assessment in areas not included
in the NCLB accountability system, such as Kansas which developed an AA-MAS for writing
and social studies. It is important to reiterate, however, that each state‘s decision is connected
to a unique set of policies and priorities. There is not a uniform or best solution for all states.
Table 10-1 AA-MAS Name, Content Areas, and Grade by State
State
Assessment Name
Content Areas/ Grades
Kansas
KAMM (Kansas Assessment of
Multiple Measures)
Reading ( 3-8; once in HS); Math ( 3-8; once
in HS); Writing (5,8, once in HS); History/Gov
(6, 8, once in HS); Science (4,7, once in HS)
Louisiana
LAA2 (LEAP Alternate
Assessment, Level 2)
English (Grades 4-10); Math (Grades 4-10);
Science (Grades 4, 8 and 11); Social Studies
(Grades 4,8,11)
Maryland
Mod-MSA (Modified Maryland
School Assessment) and ModHSA (Modified High School
Assessment)
Reading/ELA ( 3-8, HS); Mathematics ( 3-8,
HS)
North Carolina
NCEXTEND
Reading (Grades 3-8); Math (Grades 3-8);
Science (Grades 5 and 8)
North Dakota
North Dakota Alternate
Assessment Aligned to North
Dakota Content Standards for
Students with Persistent Cognitive
Disabilities
Reading (3-8,11); Math (3-8,11); Science
(4,8,11 )
Oklahoma
CARG-M (CARG=Curriculum
Access Resource Guide)
ELA/Reading (Grades 3-8, HS); Math
(Grades 3-8, HS); Science (Grades 5 and 8)
Table reproduced from Lazarus, S. S., Thurlow, M. L., Christensen, L. L., & Cormier, D. (2007). States’
alternate assessments based on modified achievement standards (AA-MAS) in 2007 (Synthesis Report
67). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Participation Options and Evidence
As addressed in Chapter 1 of this volume, there are five alternatives for assessment
participation. These are: 1) participation in the general grade-level assessment; 2) participation
in the general grade-level assessment with accommodations; 3) participation in an alternate
Considerations for an AA-MAS
Page 341
assessment based on modified academic achievement standards; 4) participation in an
alternate assessment based on alternate achievement standards; and 5) participation in an
alternate assessment based on grade-level academic achievement standards (AA-GLAS). The
fifth option differs from the AA-MAS in that the AA-GLAS performance expectations must be
directly related to those on the general assessment.
The regulations further stipulate that a state must establish participation criteria for IEP
teams based on evidence that the student‘s disability has precluded the student from achieving
grade-level proficiency and the student‘s progress suggests the student will not reach gradelevel proficiency during the academic year. Therefore, a key issue with the implementation of an
AA-MAS will be the development of guidelines to inform participation decisions and the
collection of evidence that meets the criteria described. In previous chapters more detailed
information was provided about the guiding perspectives and approaches to identify the
population of students that are appropriate for the AA-MAS. In this section, the focus is on the
specific, objective data sources and methods that may be considered to inform these decisions.
One approach is to analyze extant assessment data from interim, formative, or
summative state assessments or other commercially available standardized assessments. The
advantage of using state curriculum-based assessments is that the performance level provides
direct evidence of student performance with respect to grade-level expectations. Eligibility
criteria may be related to persistent low performance (e.g. failure to achieve proficiency in more
than one administration) and/or performance that is well below standard (e.g. performance level
one.) This approach is bolstered if the state can produce evidence that the probability of
achieving on grade level on the general assessment in the current year is low given
performance the previous year. For example, if the criterion selected was level one (e.g., Below
Basic) performance on the summative state assessment and only a very small percentage of
students scoring at level one go on to score at or above level three in the following year, this
signals that the expectation is reasonable. Other commercially available standardized
Considerations for an AA-MAS
Page 342
assessments, such as a norm-referenced assessments, may also be candidates for evidence.
For example, regression analyses may be employed to produce a predicted score on the state
curriculum assessment for various NRT score values. These data can be analyzed to determine
a suitable eligibility criterion that indicates that students below the standard are unlikely to
perform on grade level.
Another category of evidence to consider is related to the student characteristics. For
example, the state may review performance for students based on disability category to
determine which are associated with persistent low performance. The Georgia Department of
Education conducted one such study that explored many factors including disability type and
revealed that students with mild intellectual disabilities were disproportionately represented
(Fincher, 2007). While disability category may not be used as a criterion for participation, such
analyses can provide information to better identify the group of students who might benefit from
participation in an AA-MAS or to evaluate the extent to which schools and systems are making
appropriate participation decisions. These analyses involve two basic elements. First, select a
condition that identifies students who are consistently below grade level (e.g. below level 3 on
state assessment performance in consecutive years). Second, explore these data for patterns
that may provide more information about the group. For example, are there strands or domains
within content areas where performance is particularly low? Are students who received certain
accommodations disproportionally represented compared to the state as a whole?
It is noteworthy that Georgia‘s study identified many persistently low-performing students
who do not receive special education services. This invites serious consideration as to why
these students are not meeting academic achievement standards and to what extent these
same factors are applicable to students with disabilities. At least part of the answer is likely to be
that instructional approaches and supports for these students have been ineffective.
For this reason, evidence should be collected to document the extent to which students
received instruction aligned with the curriculum at the appropriate grade level. Moreover, what
Considerations for an AA-MAS
Page 343
supports and interventions have been in place to promote achievement? In reviewing this
information, it is worthwhile to consider how approaches are similar or different for lowperforming students with disabilities compared to other similarly performing students. This
information may help policymakers disentangle which students are the most prominent
candidates for an AA-MAS and which students (both with and without disabilities) may benefit
from improved instruction and support strategies.
Another important aspect related to participation options is the establishment of
guidelines for students to transition from the AA-MAS to the general assessment. It is possible
that students may take the AA-MAS in all content areas or take the AA-MAS in selected content
areas and the general assessment in others. Given that placement decisions need to be made
annually, guidelines for transition should be developed that are informed by appropriate
evidence.
One way to accomplish this goal is to establish a policy based on a specific score on the
AA-MAS. For example, students scoring at the advanced performance level may automatically
move out of the AA-MAS to the general assessment the following year. In Chapter 8 (Abedi, this
volume), the topic of establishing comparability between the assessments is explored. The
extent to which there is an explicit, quantifiable relationship between the assessments using the
techniques discussed will guide the decision. Such evidence should indicate that the AA-MAS
can produce a grade-level achievement indicator that is explicitly and demonstrably comparable
to proficiency on the general assessment. This should be based on the extent to which both the
content and performance expectations are comparable. Examples of evidence might include:
comparison of the distribution of content standards addressed, including cognitive complexity,
between the general assessment and the AA-MAS at the ‗exit‘ standard; performance level
descriptors for the comparable achievement levels are designed to closely match; and/or a
review of performance data shows that a reasonable number of students who exit the AA-MAS
subsequently achieve proficiency on the general assessment.
Considerations for an AA-MAS
Page 344
The use of multiple indicators will strengthen such decisions. For example, a profile
approach could be implemented that takes advantage of several data sources. Such an
approach may involve establishing a number of categories that indicate various conditions
under which eligibility to exit may be supported. Examples of such profiles might include 1)
scoring at the advanced level on the AA-MAS; 2) scoring between levels 2 and 3 while also
achieving a criterion score on a district assessment; 3) achieving a specific level of course
performance in tandem with AA-MAS and/or local assessment scores; 4) recommendation from
IEP committee etc. These examples are intended to be illustrative and each profile should be
carefully developed and monitored to ensure they are reasonable and appropriate.
Accountability System Background
In addition to considering the role of the AA-MAS in the general assessment system, it is
also important to consider how adoption of such an assessment will fit into the NCLB
accountability system. New York State‘s NCLB accountability system is authorized by 8 NYCRR
§100.2 which states in part, ―Each year…the commissioner shall review the performance of all
public schools, charter schools and school districts in the State. For each accountability
performance criterion specified…the commissioner, commencing with 2002–2003 school year
test administration results, shall determine whether each public school, charter school and
school district has achieved adequate yearly progress.‖ The code provides a full description of
the system, including how AYP is determined and schools are designed as requiring academic
progress.
As described in 8 NYCRR §100.2 and consistent with federal requirements, New York
State‘s accountability system is comprised of three main elements: 1) participation rate; 2)
academic achievement; and 3) an additional indicator. The participation criterion requires that
95% of students in all applicable subgroups take part in state assessments annually. Academic
achievement is measured by yearly performance on state curriculum assessments in ELA and
Considerations for an AA-MAS
Page 345
mathematics for grades 3–8 and high school. This is operationalized by a performance index
system that is evaluated with respect to effective Annual Measurable Objectives (AMOs); these
will be discussed in more detail later in this section. Finally, performance on science
assessments or attendance serves as the additional indicator in grades 3–8 and graduation rate
is the additional indicator for high schools. Meeting the overall AYP standard for schools and
LEAs is based on all subgroups meeting all criteria. That is, the criteria are considered
conjunctively—if any group fails to meet the standard the school does not make AYP.
An essential component of any examination of accountability practices is to clarify the
purpose of the system and the underlying theory of action. Because New York State‘s system
was designed to be compliant with federal regulations, language from the 2001 NCLB Act (20
U.S.C. § 6301) may serve as a guiding statement of purpose, ―to ensure that all children have a
fair, equal, and significant opportunity to obtain a high-quality education and reach, at a
minimum, proficiency on challenging state academic achievement standards and state
academic assessments.‖ Based on this idea of promoting equity and achievement, a theory of
action can be shaped for the accountability system. Drawing on a simplified conceptualization
proposed by Marion et al. (2002), the essence of such a theory involves the following: 1)
accountability policy provides incentives, such as recognition or sanctions; 2) awareness and
expectations regarding school performance are heightened; 3) educators and students benefit
from resources and development; 4) these factors contribute to an improvement in student
achievement.
Against this backdrop, the impact of introducing an AA-MAS into the New York State
accountability can be more appropriately assessed. In the best case, the AA-MAS should
provide more trustworthy information about student performance to better guide accountability
determinations and allocations of resources. As outlined in Chapter 2 (Quenemoen, this
volume) this theory is connected to a guiding philosophy that values improved student
outcomes, and promotes systems and structures that effectively and consistently support this
Considerations for an AA-MAS
Page 346
objective. To the extent that this occurs, the validity and reliability of accountability
determinations should be augmented.
The validity of the accountability system is strongly tied to the design of the system, as
well as the intended use of results. The central validity focus is ensuring that the assessments
used in the model are trustworthy for classifying selected students with disabilities as proficient
or not proficient, which was addressed in Chapter 9 (Marion, this volume). If the assessments
provide better information than the general assessments, the validity of the accountability model
should be improved. However, if the AA-MAS is poorly suited for this purpose (e.g. is used to
lower expectations for students with disabilities rather than provide accurate information with
respect to achievement) then the validity of the accountability model is threatened. However, it
is assumed that the structure and purpose of the accountability system would remain intact if an
AA-MAS were introduced. For this reason, the primary focus will be evaluating the extent to
which the system continues to function as it is currently designed in a stable and consistent
manner. This is primarily an issue of reliability, which will be the focus in this chapter.
Evaluating the Reliability of Accountability Determinations
There are two primary sources of error that impact the reliability of accountability
systems: measurement error and sampling error. Measurement error refers to the extent to
which individual assessments in the accountability system produce stable and consistent
results. This is influenced by variability in the population of students who take a specific
administration of the test. Sampling error, on the other hand, refers to variations in the school
population from year to year.
The literature related to evaluating measurement error or reliability is fairly well
established. Reliability can be defined in practical terms as the degree to which an examinee‘s
performance on a test is consistent over repeated administrations of the same or alternate
forms (Crocker and Algina, 1986.) It is possible to evaluate test score reliability using a number
Considerations for an AA-MAS
Page 347
of approaches to include those based in item response theory, generalizability theory, or classic
test theory. Drawing from the latter category, ‗test-retest‘ and/or parallel form methods are wellknown. As the name implies, test-retest approaches involve administering the same
assessment to a group of examinees on two or more occasions. The correlation of scores yields
an indication of the stability of the measure. Alternately, one can administer forms designed to
be parallel to a group of examinees to produce a measure of equivalence. A more robust
approach involves combining the two methods by administering different (equivalent)
assessments to the same group of examinees at two or more points in time to yield an
indication of stability and equivalence. Because this approach is influenced by error related to
time and form differences, a strong correlation bolsters evidence for reliability. Still another
approach, used more commonly, is to calculate reliability based on internal consistency. This
method is attractive due to the practical advantages of obtaining a reliability measure based on
a single administration of a single form. There are a number of methods to implement this, but
perhaps the most familiar is Cronbach‘s coefficient alpha. Finally, there is a family of methods
based on inter-rater reliability, suitable for assessments involving responses or evidence that
must be evaluated by a human rater.
A full discussion of how to operationalize each of these and other approaches to quantify
the reliability of an assessment is beyond the scope of this document. The reader is referred to
seminal works such Crocker and Algina (1986) and Haertel (2006). Moreover, the Standards for
Educational and Psychological Testing (AERA, APA, & NCME,1999) provides the ‗industry
standard‘ conventions for evaluating and reporting the measurement error associated with test
scores.
It is important to acknowledge that the appropriate method for calculating reliability may
differ depending on the approach that is selected for the AA-MAS. Moreover, many researchers
stress the need for new and flexible approaches that are designed to ‗fit‘ the assessment. Gong
and Marion (2006) assert, ―evaluating the technical quality of alternate assessment systems
Considerations for an AA-MAS
Page 348
requires drawing on existing psychometric and evaluation techniques as well as modifying
existing approaches or inventing new ones.‖ This could include any number of procedures
designed to quantify the precision of scores under various conditions, the consistency of raters,
and/or the integrity of the scoring process.
The second factor related to reliability of accountability determinations is sampling error.
In fact, Hill and DePascale (2002) emphasize that sampling error, ―contributes far more to the
volatility of school scores than does measurement error.‖ Sampling error refers to fluctuations in
school scores that can be unrelated to actual school performance. For example, a school may
receive a more favorable accountability determination compared to the previous year, because
the students enrolled were inherently higher performing, and not because the quality of
instruction improved. Naturally, sampling error can work to both advantage or disadvantage
reported accountability determinations.
Hill and DePascale (2002) present four approaches to evaluate sampling error by
estimating the precision or consistency of accountability classifications. The most
straightforward method is termed split-half and simply involves dividing the data for each school
into randomly equivalent halves and calculating the percentage of times the same decision is
made for each half. Another method involves taking random draws with replacement by
repeatedly producing random samples from the schools to evaluate decision consistency. A
Monte Carlo approach can also be implemented, which involves simulating the distribution of
scores and creating randomly generated samples from which classification consistency can be
evaluated. Finally, direct computation, involves calculating exact probabilities for correct
classification by determining the distribution of errors. For an extended treatment on these
methods including details on operationalization, the reader is referred to Determining the
reliability of school scores (Hill & DePascale, 2002).
Arce-Ferrer, Frisbie, and Kolen (2002) also examined the effect of sampling error on
year-to-year changes in achievement expressed as proportions (e.g. percent proficient). They
Considerations for an AA-MAS
Page 349
found that about two thirds of the variability in estimates were related to sampling error and
about one third could be broadly attributed to intervention effects, systematic errors,
measurement errors, and equating error. The authors evaluated error by comparing observed
variability in proportions with expected variability for one and two year changes at different
performance ranges and group sizes. Expected variability was determined by calculating the
error variance of the difference between proportions under a binomial model. These methods
could also be applied to study changes in New York State‘s accountability determinations.
Perhaps no factor impacts sampling error or classification consistency of an
accountability system more than sample size. Simply stated, larger subgroups produce more
stable and consistent results. As a matter of practice, confidence intervals are often used in
accountability systems to both gauge and mitigate the effects of sampling error due to sample
size. Confidence intervals are constructed by: 1) determining the standard error for a proportion,
where the proportion is the target percent proficient or AMO; 2) multiplying this by a desired
level of precision corresponding to a distribution value (e.g. z score); and 3) subtracting this
figure from the target value to achieve a range of performance within which values are regarded
as not significantly different.
The state of New York incorporates confidences intervals in the accountability system
through effective AMOs. Effective AMOs are designed to integrate confidence intervals with the
Performance Index (PI) in a straightforward manner. To accomplish this, the NYSED has
produced tables that indicate for various group sizes, the smallest observed PI that is not
statistically different from the AMO (i.e. within the confidence interval.) New York State uses a
90% confidence interval and a minimum n of 30 for academic achievement.
Operational Considerations for New York State’s Accountability System
New York State‘s accountability system, like those of other states, may be said to be
indifferent to the source of proficiency. In other words, the system is designed such that
Considerations for an AA-MAS
Page 350
whatever instrument or process is used to determine a student‘s performance level, the ‗gears‘
of the system should function to produce an accountability outcome without disruption.
Presently, performance levels are input from the NYSTP assessments in grades 3–8, Regents
Examinations in high school, and the NYSAA for students with significant cognitive disabilities.
Each of these assessments classifies a student into one of four performance levels which are
incorporated into the system.
The New York State Accountability model does have a unique feature, however, that
governs how proficiency determinations are produced. In lieu of percent proficient measures,
New York State uses a Performance Index (PI). The PI system involves computing a ratio such
that the students scoring at levels 2, 3, 4 and those scoring at levels 3 and 4 only are divided by
all continuously enrolled students. This figure is multiplied by 100 to produce the index. For
example, if a school has 200 students and 40 of them scored at level 1, 80 at level 2, 60 at level
3, and 20 at level 4 the index would be calculated as: ((80+60+20+60+20)/200)x100 which is
120. The index can range from 0, if all students are at level 1, to 200, if all students are at level
3 or higher. This approach incentivizes student improvement below proficiency by providing a
boost to the index value when a student progresses from level 1 to level 2.
One straightforward approach to incorporating the AA-MAS in the system would be to
establish four achievement levels corresponding to those of the existing AYP assessments. By
so doing, performance from the AA-MAS can be included in the PI in the same manner.
However, design decisions may restrict this possibility. For example, if the assessment is
determined to produce limited information such that only three levels can be produced,
alternatives for adjusting the PI will need to be considered. This might involve eliminating an
advanced designation, which should have no computational impact on the index, or eliminating
the basic proficient level (i.e. treat levels 2 and 3 like levels 3 and 4 in the PI) in which case the
‗partial-credit‘ advantage of the PI would be reduced. (Understanding, of course, that the real
impact is more connected to the rigor of the standard than the nomenclature of the standard.)
Considerations for an AA-MAS
Page 351
Another operational issue to consider is managing the 2% cap. As previously indicated,
the 2% cap refers to the upper limit on the number of proficient and advanced scores that a
state or district can count toward proficiency in AYP from the AA-MAS; it does not restrict the
number or percent of students who may participate in the assessment. The state or system may
only exceed the 2% proficiency cap if the percent of students assessed on the NYSAA is below
1%. In this manner the 2% can be thought of as a ―soft cap‖ where the 1% is a ―hard cap‖. That
is, the 2% may be exceeded as long as it does not extend beyond the margin the state or
system has under the 1% for the AA-AAS. For example, if 0.7% of all students in New York
State‘s accountability system are counted as proficient on the NYSAA, then as high as 2.3% of
students in the accountability system can be counted as proficient on the AA-MAS.
USED policy further specifies that all proficient scores from an AA-MAS that exceed the
2% limit, must be counted as non-proficient in AYP calculations. These scores must be counted
as non-proficient for the state, system, school and for each subgroup in which the student is a
member. This compels the state to determine which scores will be deemed non-proficient—a
process referred to as ‗redistribution.‘
In guidance, USED refers to a paper by Martinez and Olsen (2004) which describes four
methods to implement redistribution. The first approach is to randomly assign non-proficient
scores back to schools where any students tested on the AA-MAS. A second method is termed
proportional. This involves assigning non-proficient scores back to schools corresponding to
either the proportion of tested students or the proportion of proficient students at the school. A
strategic approach is also described, which involves making decisions for each school that
maximize the chance that the school will make AYP (e.g. assigning non-proficient scores back
to groups that exceeded AMOs such that the outcome is unchanged.) Finally, the authors
propose a pre-determined school cap approach. This involves determining a limit or formula for
each school based on the expected number of student participating in an AA-MAS.
Considerations for an AA-MAS
Page 352
The decision of which approach to implement should be measured against the
department‘s priorities and the inherent advantages and risks of each. For example, the
strategic approach may seem attractive because it will likely produce the fewest number of
schools not making AYP. However, not only would this method be difficult to implement in an
unbiased manner, it may also enable potentially inappropriate AA-MAS participation practices.
The random and proportional methods seem straightforward to implement, however these may
penalize sound participation practices and do nothing to account for a school that serves a large
number or percentage of students with disabilities. An adapted or hybrid pre-determined method
may be the most promising approach. This method would require the state of New York to
carefully establish the expected participation rate in the AA-MAS for schools and systems,
perhaps based on previous enrollment or assessment practices. Then, the state would apply
additional scrutiny to the schools that deviated from expectation by the largest margin. Schools
that deviated for defensible reasons would be protected, but others may be required to adjust a
selected number of proficient scores.
An additional consideration for the state of New York is to decide which scores should
be redistributed and how they should be reassigned within the performance index system.
Because level 3 and 4 scores are always fully proficient in the index and level 1 scores are
always non-proficient, the primary concern is level 2 scores. Essentially, the index treats these
as ‗partially‘ proficient. That is, the current value produced by the index is the midpoint between
the values that would have been produced if either all level 2 scores were treated as nonproficient or all level 2 scores were treated as fully proficient. For that reason, it seems
appropriate to regard these values as one-half (.5) proficient for purposes of redistribution. In
this manner, districts could assign the designated number of level 3 or level 4 scores to level 1
or twice as many of these scores to level 2. For example, if a district had to redistribute 10
proficient scores to non-proficient scores they could either select 10 level 3 scores and make
them level 1 scores, or they could select 20 level 3 scores and make them level 2 scores.
Considerations for an AA-MAS
Page 353
Similarly, the district could select 20 level 2 scores and make them level one scores.
Mathematically, it is inconsequential, as each approach produces the same PI value.
Evaluating New York State’s Accountability Determinations
Earlier it was mentioned that the accountability system is in many ways indifferent to the
proficiency input. This is intended to convey that from an operational perspective incorporating
results from an AA-MAS into New York State‘s model is, with some exception, straightforward.
However, this is not to suggest that that the accountability output is unaffected by the
introduction of an AA-MAS. Indeed, a central question remains: how will mixing the results from
three tests into a single accountability outcome affect results?
Addressing this question will require some purposeful analyses to understand the
impact. A good starting point would be to explore the distribution of students who may be
‗candidates‘ to take the AA-MAS throughout the state. The information in Chapter 2
(Quenemoen, this volume) may be helpful in indentifying the characteristics of interest — such
as students with certain disability types or those who persistently perform at the lowest
performance level on general state assessments. Using this information, it will be beneficial to
determine if the students are distributed uniformly (i.e., most schools enroll a similar
percentage) or if the students are clustered in certain districts or schools (i.e., some enroll a
high percentage while others enroll few to none). Moreover, are potential AA-MAS students over
represented in other subgroups (e.g. racial ethnic groups, economically disadvantaged, etc.)?
Previous research suggests that an expected finding will be that candidates for an AA-MAS are
disproportionately distributed in systems, schools, and subgroups. This is likely to have the most
impact on accountability determinations for those units or subgroups with the highest
representation.
A second category of analyses involves exploring the pattern of accountability
determinations for subgroups and schools. This can be accomplished prior to implementing an
Considerations for an AA-MAS
Page 354
AA-MAS by modeling or simulating a hypothesized statewide AYP outcome. One approach to
implementing this would be to conjecture that the students who scored in the lowest 2% on the
general assessments will take the AA-MAS. Then, ‗new‘ determinations can be produced with
extant data by introducing conditions such as: 1) assume none of the students scored proficient
on the AA-MAS; 2) assume the top 25% scored proficient; and 3) assume the top 50% scored
proficient etc. For example, in the third condition, all students scoring above the median in the
distribution of scores for the 2% of lowest performance students on the general assessment
would be designated as proficient on a hypothetical AA-MAS. Then, 2008 AYP determinations
would be calculated with this change and the results would be compared to the actual
outcomes. Of particular interest will be a review of results at system, school, and subgroup level
to gauge which areas are likely to have the most substantial impact. In the method described,
the performance categories can certainly be modified, but serve to illustrate the proposed
approach. This method, while not exact, can provide an indication of expected accountability
outcomes (if only ‗best‘ or ‗worst case‘ scenarios) to assist the NYSED in understanding and
preparing for fluctuations in accountability determinations.
When an AA-MAS is implemented, the NYSED should continue to carefully monitor the
consistency of determinations from year to year. Such monitoring at the district, school, and
subgroup level can illuminate components of the accountability system that are most volatile.
This may involve simply tracking changes in the PI for schools and subgroups and comparing
the numbers and percent of schools and groups that make AYP. For schools that do not make
AYP, it will be useful to track both the number and type of subgroups that missed the AMO, as
well as the margin by which AMO was not achieved.
As discussed in the previous section, confidence intervals are the primary mechanism
for dealing with sample variability in the accountability model. Because the introduction of an
AA-MAS can have an impact on the PI for all students and especially the SWD subgroup, the
NYSED may find it beneficial to evaluate the effective AMOs. One approach may be to model
Considerations for an AA-MAS
Page 355
results given different same size ranges. Currently, the ranges vary by 5 until a group size of 50
and then increase by units of 10 getting progressively larger. Because the confidence interval
stabilizes with large n sizes, it is unlikely that the upper range will be impacted. However, for
smaller n sizes, it may be useful to adjust the ranges (perhaps constricting them) and note
differences in classification outcomes for schools and subgroups.
The collection of multiple sources of qualitative and quantitative information will
strengthen overall findings. For example, if data exist related to outstanding professional
development or instructional programs, how do the schools and/or groups recognized for such
programs perform on the AA-MAS in particular and the accountability system in general?
Additionally, the NYSED may wish to be intentional about collecting data regarding the
opportunity to learn and student characteristics for the population taking the AA-MAS. This may
be accomplished through initiatives such as surveying teachers and school leaders on the
quality and consistency of instructional opportunities, student engagement, and other indicators
(e.g. class work) of student success. Some of these methods are discussed in further detail in
Chapter 3 (Karvonen, this volume). By comparing this information with AA-MAS results and
accountability determinations, additional evidence about the efficacy of the system may be
produced.
Finally, in analyzing findings, it is important to consider both Type I and Type II errors. A
Type I error may be said to occur when a school with strong, effective programs does not make
AYP and is determined to be in an improvement status. A Type II error describes the situation
where a school in need of improvement is erroneously classified as meeting standards. In
practice, an increase in Type II error may be the larger threat with the introduction of an AAMAS. Ideally, if fewer schools are classified as needing improvement, it will be due to more
appropriate assessments that accurately reflect a higher level of student achievement
previously masked by barriers on the general assessment. However, to the extent that the AA-
Considerations for an AA-MAS
Page 356
MAS is used to lower expectations, Type II error will be elevated and students in need of
support services may not be identified.
Reporting
Another important element of an assessment and accountability system is public
reporting. Decisions about the design and distribution of performance reports directly impact the
theory of action that can promote student and school improvement. Therefore, a plan for
effective assessment accountability reporting practices related to the AA-MAS is essential. In
general there are three major considerations with respect to reporting: 1) identify the information
that should be reported; 2) determine how the information should be presented; and 3) decide
how the information will be disseminated.
The United States Department of Education has explicitly defined the information that
must be reported in NCLB compliant accountability systems, which is currently incorporated in
NYSED‘s reporting system. Additional requirements from the 2007 regulations stipulate that
accountability determinations should include: 1) the number of students with disabilities
participating in the general assessments and the number provided accommodations; 2) the
number participating in the AA-AAS and the AA-MAS; and 3) performance results for students
taking each assessment.
The guiding principle for designing reports is to make the information accessible to
stakeholders such that it is actionable. In her 2002 CCSSO publication addressing
accountability reporting, Ellen Forte proposes the following criteria for effective reports:
Accessible to the target audiences, both physically and linguistically;
Accompanied by adequate interpretive information;
Supported by evidence that the indicators, other information, and suggested
interpretations are valid;
Coordinated with other reports within the reporting system:
Considerations for an AA-MAS
Page 357
o
Across paper and electronic versions of report cards, and
o
Across reports cards and assessment reports.
These criteria suggest that the reports should be designed such that they are technically
comprehensive, but simple to read and understand by all stakeholders—a nontrivial task.
However, there are a few approaches that may help accomplish this. For example, the NYSED
may consider including reader-friendly narratives that describe the knowledge and skills in each
performance level on student level reports and/or supporting documents. Moreover, presenting
key information in graphical format on both student and summary reports often improves the
readability and usefulness of reports. To the extent that it is practicable, reports should follow a
standard format across programs, which may reduce confusion for consumers of multiple
reports. Finally, reports and supporting documents are often reviewed by broad-based
committees to promote the likelihood that the information is presented appropriately.
Moreover, it will be important to support appropriate interpretation and use of the results
of the AA-MAS. The Standards for Educational and Psychological Testing (AERA, APA, &
NCME, 1999) address this principle explaining, ―interpretations should describe in simple
language what the test covers, what scores mean, the precision of the scores, common
misinterpretations of test scores, and how scores will be used‖ (p. 65). This is a vital component
of any testing program but is particularly important given the distinctive nature of an alternate
assessment and the students assessed. Examples of support initiatives might include
distributing an interpretative guide, developing online resources, and/or conducting training
workshops with educators.
Diploma Eligibility
One important policy issue for the NYSED is the impact of an AA-MAS on diploma
eligibility. Currently, students may exit high school with a local diploma, a Regents diploma, an
Advanced Regents diploma, or an IEP certificate. If the AA-MAS is developed for high school
Considerations for an AA-MAS
Page 358
content, a decision regarding whether or not students will be eligible for a Regents or local
diploma and, if so, what level performance is required, must be resolved.
USED regulations require states to ensure that students who take an AA-MAS are not
precluded from attempting to complete the requirements for a regular high school diploma. This
requirement does not compel the state of New York to treat AA-MAS scores as comparable to
those from general assessments with respect to diploma eligibility criteria. The regulation is
intended to prohibit tracking that might prevent a student from taking a path that leads to a
regular diploma. Stated another way, students cannot be denied the option to qualify for a
regular diploma (whatever those qualifications are) if they take an AA-MAS at any point.
Therefore, a number of possibilities can be considered to operationalize an AA-MAS in a
way that is consistent with federal requirements. One approach would be to establish a level of
performance on the AA-MAS that is regarded as an acceptable qualification for a Regents
diploma. The methods discussed in Chapters 7 and 8 (Perie and Abedi, this volume) could
inform the selection of a cut score that serves this purpose — as might be produced in a linking
study. The topic of determining performance standards and cut scores is further addressed in
Chapter 6 (Welch and Dunbar, this volume) and Chapter 7 (Perie, this volume).
Another approach would be to continue the policy of using performance on the Regents
Examination as the acceptable qualification for a Regents or local diploma. If this approach is
selected, the importance of developing clear guidelines and procedures for how students can
move from an AA-MAS to the general assessment is elevated. That is, students will need to be
clearly informed about the path and requirements necessary to qualify to take a Regents
Examination and all students should have an opportunity to pursue that path.
A third option might involve establishing multiple criteria for diploma eligibility for
students that take an AA-MAS. This rationale behind this option is that the AA-MAS alone may
not provide sufficient evidence that a student has achieved graduation requirements. However,
coupled with additional indicators, such a decision can be supported. Examples of indicators
Considerations for an AA-MAS
Page 359
that may provide such evidence might include: recommendation from the students IEP
committee, meeting identified course-taking or performance standards, achieving a requisite
score on another assessment (e.g. SAT or ACT), or meeting selected vocational or industry
certification credentials.
The decision regarding the role of the AA-MAS in diploma eligibility should be based on
the values and priorities of the NYSED and the characteristics of the AA-MAS. That is, as a
matter of policy the department determines the knowledge, skills, competencies etc. that are
required for each diploma type. Then, the extent to which the AA-MAS produces a measure that
satisfies these criteria will largely define how it will function with respect to diploma eligibility.
Finally, there are important legal considerations to attend to if a state changes diploma
eligibility requirements. As established in the landmark Debra P. v. Turlington (1981) the state
must provide adequate notice of any changes to assessment requirements related to diploma
eligibility and ensure there is a high degree of content validity. Moreover, it is advisable to
conduct research (e.g. broad distribution of a survey) to gauge the extent to which students
have an opportunity to learn the knowledge and skills covered on the assessment. Such
research might include a review of IEPs to ensure learning goals and supports are in line with
expectations of the AA-MAS.
Conclusion and Recommendations
The overarching theme of this chapter is that developing and implementing an AA-MAS
should not be regarded as an isolated enterprise. A full consideration of the issues and options,
should involve a review of many practical and policy issues related to the entire assessment and
accountability system.
This process begins with an examination of whether to implement an AA-MAS and, if so,
to what extent? As discussed, this question is largely informed by carefully studying the extent
to which the current assessment system is appropriate for students with disabilities. The ‗stakes‘
Considerations for an AA-MAS
Page 360
of the assessment should also be taken into consideration when considering the scale and/or
priorities for implementation. Finally, it is unavoidable that availability of resources will influence
the capacity to move forward.
Determining eligibility criteria is another key decision. Data sources and approaches to
inform this decision were explored in this chapter such as using assessment data to evaluate
the likelihood of reaching target performance on future administrations and analyzing the
characteristics of persistently low performers. Finally, setting and evaluating participation criteria
is bolstered when multiple, corroborating data sources are used.
This task is complicated by the need to disentangle low performance due to disability
from that which is due to lack of opportunity to learn. It remains critically important for states to
investigate strategies to support all learners by evaluating educational services. Moreover, it is
advisable to review the development process and policies related to general assessments to
maximize the likelihood that all students are afforded the opportunity to demonstrate what they
know and can do. This may include such practices as attention to universal design or a review
of accommodations options to ensure they are effective and appropriate.
It is also important to explore the impact of the AA-MAS on the state accountability
system. There are methods available to evaluate decision consistency, which is impacted by
two main sources: measurement error and sample error; the latter of these accounts for most of
the variability in accountability determinations. Accordingly, some approaches suggested by Hill
and DePascale (2002) and Arce-Ferrer, Frisbie, and Kolen (2002) were presented to evaluate
the impact of sample error.
The discussion of impact to accountability systems also included a review of operational
considerations. In this section, some features specific to the state of New York (e.g. effective
AMOs and the Performance Index) were discussed. Additionally, some approaches suggested
by Martinez and Olson (2004) to manage the redistribution of non-proficient scores were
presented. The author concludes that a method based on pre-determining thresholds for district
Considerations for an AA-MAS
Page 361
participation rates may be most promising, provided the state applies additional scrutiny to
explore and possibly adjust for defensible deviations from these values.
In the following section, a number of specific analyses to evaluate the impact to
accountability systems were suggested. Many of these approaches can be conducted annually
for ongoing system monitoring, which is certainly advisable. In the discussion, a method to
provide advance information about the impact of implementing an AA-MAS was proposed.
Because it is likely that fluctuations will be non-uniform, the primary benefit of this approach will
be to identify the areas that are likely to have the most substantial impact, which can help the
state prepare for implementation.
Certainly, the utility of assessment information is strongly tied to the quality of external
reports. For this reason, some succinct recommendations were presented to produce
accessible information on student and summary reports and produce well-designed support
materials. This may be best accomplished by having broad based groups assist with design or
review of materials. Moreover, maintaining some consistency of presentation on the reports will
increase the likelihood that the information provided on the reports will be meaningful to
stakeholders.
Finally, some considerations related to diploma eligibility policy were presented. The key
point is that policies should be established that provide a path for students who take an AA-MAS
to be eligible for a regular diploma. Such a policy may identify a specific performance level on
the AA-MAS or may involve alternate and/or multiple criteria to meet this standard. In any case,
the policy should be clearly articulated and in line with the state‘s values and priorities for high
school graduates.
Ultimately, the NYSED‘s objective is to ensure the continuance of a coherent, effective
assessment and accountability system. This is accomplished by careful planning and systematic
evaluation. By so doing, the state is able to design and operationalize a more suitable
Considerations for an AA-MAS
Page 362
assessment and accountability system, which best positions the state of New York, or any other
state, to promote student achievement.
Considerations for an AA-MAS
Page 363
References
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and
psychological testing. Washington, DC: American Educational Research Association.
Arce-Ferrer, A., Frisbie, D.A., & Kolen, M.J. (2002). Standard errors of proportions used in
reporting changes in school performance with achievement levels. Educational
Assessment, 8(1), 59-75.
Chapman v. California Department of Education, 229 F. Supp. 981 (N.D. Calif., 2002)
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Belmont, CA:
Wadsworth.
Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometricka,
16, 297-334.
Darling-Hammond, L. (2006). Standards, Assessments, and Educational Policy: In Pursuit of
Genuine Accountability. Eighth Annual William H. Angoff Memorial Lecture. Princeton,
NJ: Educational Testing Service.
Debra P. v. Turlington, 644 F2d 397 (5th Cir. 1981).
Fincher, M. 2007. ―Investigating the Academic Achievement of Persistently Low Performing
Students‖ in the session on Assessing (and Teaching) Students at Risk for Failure: A
Partnership for Success at the Council of Chief State School Officers Large Scale
Assessment Conference, Nashville TN, June 17-20, 2007. Available at:
http://www.ccsso.org/content/PDFs/12%2DMelissa%20Fincher%20Paul%20Ban%20Pa
m%20Rogers%20Rachel%20Quenemoen.pdf.
Forte Fast, E. (2002). A guide to effective accountability reporting. Council of Chief
State School Officers State Collaborative on Assessment and Student Standards
Accountability Systems and Reporting Consortium.
Fuhrman, S. & Elmore, R. (Eds.). (2004). Redesigning accountability systems for education.
New York: Teachers College Press.
Gong, B., & Marion, S. (2006). Dealing with flexibility in assessments for students with
significant cognitive disabilities. Dover, NH: National Center for the Improvement of
Educational Assessment.
Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp.
65-110). Westport, CT: American Council on Education/Praeger.
Hill, R.K., & DePascale, C.A. (2002). Determining the reliability of school scores. Portsmouth,
NH: The National Center for the Improvement of Educational Assessment Inc.
Lazarus, S. S., Thurlow, M. L., Christensen, L. L., & Cormier, D. (2007). States’ alternate
assessments based on modified achievement standards (AA-MAS) in 2007 (Synthesis
Considerations for an AA-MAS
Page 364
Report 67). Minneapolis, MN: University of Minnesota, National Center on Educational
Outcomes.
Marion, S., White, C., Carlson, D., Erpenbach, W, Rabinowitz, S., & Sheinker, J. (2002). Making
valid and reliable decisions in determining adequate yearly progress. Washington, DC:
Council of Chief State School Officers.
Martinez, T., & Olsen, K. (2004). Distribution of proficient scores that exceed the 1% cap: Four
possible approaches. Mid-South Regional Resource Center. Available at:
http://eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/1b/a3/1
f.pdf
No Child Left Behind Act of 2001, 20 U.S.C. § 6301
O‘Day, J., & Smith, M. (1993). Systemic school reform and educational opportunity. In S. H.
Fuhrman (Ed.), Designing coherent education policy: Improving the system. San
Francisco: Jossey-Bass.
Perie, M. (2007). Key elements for educational accountability models. Washington, DC:
Council of Chief State School Officers.
United States Department of Education. Modified-Academic Achievement Standards: NonRegulatory Guidance. April, 2007.
United States. Department of Education. National Commission on Excellence in Education. A
Nation at Risk: The Imperative for Educational Reform. Washington: GPO, 1983.
Considerations for an AA-MAS
Page 365
APPENDIX A: INDIVIDUALS INVOLVED IN THIS PROJECT
Project Coordinator:
Larry Hirsch, NYCC
Project Assistant:
Laticha Sotero, NYCC
Project Manager/Editor:
Marianne Perie, Center for Assessment
Expert Panel:
Jamal Abedi, University of California, Davis
Chris Domaleski, Center for Assessment
Steve Dunbar, University of Iowa
Howard Everson, Fordham University
Claudia Flowers, University of North Carolina, Charlotte
Brian Gong, Center for Assessment
Meagan Karvonen, Western Carolina University
Suzanne Lane, University of Pittsburgh
Scott Marion, Center for Assessment
Jim Pellegrino, University of Illinois, Chicago
David Pugalee, University of North Carolina, Charlotte
Rachel Quenemoen, National Center on Educational Outcomes
Robert Rickelman, University of North Carolina, Charlotte
Katherine Ryan, University of Illinois, Urbana Champagne
Gerald Tindal, University of Oregon
Cathy Welch, University of Iowa
Phoebe Winter, Pacific Metrics
New York State Department of Education:
David Abrams, Assistant Commissioner for Standards, Assessment, and Reporting
Candy Shyer, Bureau Chief of Test Development, Office of State Assessment
Rebecca Cort, Deputy Commissioner, Office of Vocational and Educational Services for
Students with Disabilities
Considerations for an AA-MAS
Page 366
APPENDIX B: LIST OF INTERNET RESOURCES FOR EFFECTIVE CURRICULUM AND
INSTRUCTION
General Resources
K-8 Access Center (http://www.k8accesscenter.org/index.php)
Federally-funded project that has ended, but the Web site still hosts publications on access to
curriculum in a variety of content areas as well as instructional issues.
National Center on Educational Outcomes (http://nceo.info)
―Provides national leadership in the participation of students with disabilities in national and
state assessments, standards-setting efforts, and graduation requirements‖
National Center on Accessing the General Curriculum (NCAC)
(http://www.cast.org/policy/ncac/index.html)
Web site ―provides a vision of how new curricula, teaching practices, and policies can be
combined to create practical approaches for improved access to the general curriculum by
students with disabilities‖
IEPs
NASDSE 2007 Standards-based IEP examples (available at projectforum.org)
Document that describes the 7-step process for creating standards-based IEPs, then applies
those steps to two students. Sample IEPs for those students are provided.
Assistive Technologies
National Public Website on Assistive Technology (http://www.assistivetech.net/)
Hosted by the Center for Assistive Technology and Environmental Access at Georgia Institute of
Technology. Provides access to the latest assistive technology.
ABLEDATA (http://www.abledata.com/)
NIDRR-sponsored project operated by ICF Macro. Provides a comprehensive database of
assistive technology and rehabilitation devices, as well as publications and external links related
to assistive technologies.
CBM, RtI, and Progress Monitoring
National Center on Student Progress Monitoring (http://www.studentprogress.org/)
Web site that provides descriptions of what progress monitoring is, benefits/challenges, and
provides a list of CBMs reviewed on site.
National Center on Response to Intervention (http://www.rti4success.org/)
Web site that provides information on what Response to Intervention is, how the tiered-system
works, how RtI can be used with different populations, and resources such as explaining the
difference between curriculum-based measurement and curriculum-based assessment.
Easycbm.com (http://www.easycbm.com)
A Web site that provides free membership to have access to curriculum-based measurements,
reports, and charts to track student progress.
Considerations for an AA-MAS
Page 367
RtI Resources (http://www.jimwrightonline.com/php/rti/rti_wire.php)
Web site containing information on RtI, how to choose interventions, how to use problemsolving teams, how to monitor student progress, and it provides graphs to monitor.
RtI Network (http://www.RTInetwork.org)
Web site providing information on RtI; how to develop and implement an RtI plan; and breaks
down resources into Pre-K, K-5, middle school, high school, and parents/families.
Interventions and CBM (http://www.interventioncentral.com)
Web site providing information on interventions, progress monitoring, curriculum-based
measurements, graphing data, and RtI.
Progress Monitoring (https://dibels.uoregon.edu/faq.php#faq_dib3)
Web site that provides free membership and provides information on progress monitoring, as
well as access to probes for curriculum-based measurement.
Considerations for an AA-MAS
Page 368
APPENDIX C: TOOL FOR STATE POLICYMAKERS
This tool is simply a list of guiding questions for state policymakers considering the
development of an AA-MAS. Beyond providing information to think about in deciding whether an
AA-MAS fits well into a state‘s current system, it also provides guidance during the design and
development process. Each question is linked back to a section of the report for further
information about topical considerations.
Relevant
Chapter(s)/Section(s)
Topic
Guiding Question(s)
Appropriateness of Developing
an AA-MAS
Are there students who cannot be
appropriately assessed with the
state‘s current large-scale
assessment system?
Chapter 2, pages 23–39
Chapter 4, pages 103–105
Chapter 10, pages 345–347
How do we know that the problem
is the format or design of the
general assessment rather than a
lack of opportunity to learn the
material on the assessment?
Chapter 2, pages 32–39
Chapter 3, pages 52, 60
Will this state reap more benefits
from developing a new assessment
targeted towards eligible students
rather than focusing on their
instruction through another means,
such as professional development
of teachers?
Chapter 3, pages 78–79
Chapter 5, pages 163–167
What is your theory of action for
how this assessment will improve
student outcomes?
Chapter 9, pages 317–323
How do we identify the students
who are eligible to take the AAMAS?
Chapter 2, pages 30–39
Chapter 10, pages 343–347
Identifying the Target Population
What are the characteristics of
these students? (This question may
need to be answered for several
different groups of students.)
Considerations for an AA-MAS
Chapter 2, pages 23–39
Chapter 10, pages 345; 356;
358
What are your assumptions about
these students‘ ability to learn
grade-level content and to show
what they know?
Chapter 4, pages 103–105;
125–134
Are these students different from
students without disabilities who
have performed poorly on the largescale assessment? If so, how?
Chapter 2, pages 23–39
Chapter 4, pages 103–105
Page 369
Appropriateness of current
curriculum and Instruction
Appropriateness of AA-MAS for
improving student outcomes
Considerations for an AA-MAS
Do these students have standardsbased IEPs that promote an
opportunity to learn the standardsbased curriculum? How do you
know?
Chapter 3, pages 57–59; 61–
63; 69–74
What evidence exists to support a
determination that these students
will not achieve grade level
proficiency in the current year
because of the effects of their
disability and not because of lack of
opportunity-to-learn?
Chapter 2, pages 32–39
Chapter 3, pages 57–63
Chapter 9, pages 325–326
What evidence exists to support the
policy assumptions that these
students are provided high quality
access to the standards-based
curriculum, through specialized
instruction, services, and support?
Chapter 3, pages 59–69
Chapter 5, page 163
What types of training and support
are available for teachers of these
students to improve participation
and performance in the standardsbased curriculum?
Chapter 3, pages 78–79
Chapter 5, pages 164–167
What training, oversight, and
monitoring processes are built into
the system to ensure that IEP
teams make high quality decisions
about who participates in AA-MAS?
Chapter 3, pages 78–81
What is the nature of the barriers to
these students‘ participation on the
general assessment?
Chapter 4, pages 103–105;
122–124; 135; 140
Chapter 5, pages 163–167
Chapter 6, pages 210–215
How will this assessment provide a
more accurate measure of the
knowledge and skills of the
participants compared with the
general assessment?
Chapter 6, pages 224–232
How will development of an AAMAS yield more valid inferences
about the students than other
assessment approaches, such as
improved general assessment
design, appropriate
accommodations, or development
of an AA-GLAS?
Chapter 2, pages 27; 45–48
Chapter 3, pages 76–78; 82
Chapter 5, page 176
Chapter 6, pages 214–215
Chapter 8, pages 293–296;
300
What are the relative costs and
benefits of assessment
development and implementation
compared with other uses of
resources, such as targeted staff
Chapter 6, page 197
Chapter 9, page 309
Chapter 10, page 342
Page 370
development on instructional and
curricular interventions for teachers
of struggling learners?
Modified achievement standards
Test Design
Considerations for an AA-MAS
How will the inclusion of the AAMAS as part of the state‘s
assessment system lead to better
instructional and curricular
opportunities for these participating
students?
Chapter 3, pages 59–69
How do the performance
expectations of the AA-MAS relate
to those in the general assessment
and the AA-AAS? Is Proficient on
the AA-MAS similar in nature to
Proficient on the general
assessment? Is it closer to Basic?
Or is it somewhere in between?
Chapter 7, pages 240–251
Chapter 8, pages 272–273;
276
Is there an expectation that the AAMAS may provide a stepping stone
for students to reach Proficient on
the general assessment? Or, is the
expectation that students taking the
AA-MAS are a unique population
that will always need the
modifications provided? Is a
student who scores Advanced on
the AA-MAS prepared to take the
general assessment or an AAGLAS or are they simply exceeding
the criterion on their own
assessment?
Chapter 2, pages 30–39, 45–
47
Chapter 7, pages 242–243
How will you carry your philosophy
regarding the description of the
students and their barriers to
participation in a general
assessment to your design of the
AA-MAS?
Chapter 5, pages 186–192
Chapter 6, pages 208–224
Chapter 7, pages 240–246
What type of assessment best fits
your philosophy—a modification of
your general assessment? AAGLAS or a modification of the AAGLAS?
Chapter 2, pages 45–48
Chapter 6, page 252
If you choose to modify your
general assessment, which types of
modifications best match your
philosophy regarding the students‘
barriers to participation in the
general assessment?
Chapter 4, pages 140–142
Chapter 5, pages 186–192
Chapter 6, pages 208–224
Page 371
How do you intend to maintain the
depth and breadth of the
assessment but reduce the
difficulty?
Chapter 4, page 135
Chapter 5, pages 161–162;
180–192
Chapter 6, pages 206–222
How will you measure and
demonstrate the degree of
comparability between the AA-MAS
and the general assessment?
Chapter 8
Documenting the technical
quality or validating the AA-MAS
What are the important features of
technical quality and validity that
should be evaluated and
documented throughout this
process?
Chapter 9
Incorporating an AA-MAS into an
existing assessment and
accountability system
In which subjects/grades should we
develop an AA-MAS?
Chapter 10, pages 340–342
How does the AA-MAS fit between
the AA-AAS and the general
assessment? Do we expect to see
smooth transitions from one
assessment to the next?
Chapter 2, pages 48–49
Chapter 3, pages 52–53, 64–
65, 85–86
Chapter 7, pages 246–247
Chapter 10, page 346
How will you report results in a
manner that will provide maximum
information to teachers and
parents?
Chapter 4, page 138
Chapter 10, pages 359–360
Considerations for an AA-MAS
Page 372
GLOSSARY
Achievement standard: A definition of a level of performance including both a minimum cut
score and a written description that distinguishes the level of performance from other defined
levels.
Accommodation: Changes in the administration of an assessment, such as setting,
scheduling, timing, presentation format, response mode, or others, to provide better access to
the assessment in a manner that does not change the construct intended to be measured by the
assessment or the meaning of the resulting scores.
Accountability: The systematic use of assessment data and other information to evaluate the
effectiveness of a program, such as an education system, for the purpose of rewarding desired
outcomes and sanctioning undesirable outcomes.
Adaptation: A generalized term that describes a change made in the presentation, setting,
response, or timing or scheduling of an assessment that may or may not change the construct
of the assessment.
Adequate Yearly Progress (AYP): Under the No Child Left Behind Act, the minimum level of
performance that states, school districts, and schools must demonstrate each year as measured
by the proportion of students classified as Proficient or better to reach 100% Proficiency by
2014.
Alternate achievement standards: Cut scores and performance-level descriptors
differentiating achievement on tests of content linked to grade level curriculum appropriate for
students with the most significant cognitive disabilities.
Alternate assessment: An instrument used in gathering information on the performance and
progress of students whose disabilities preclude them from valid and reliable participation in the
general state assessment. Alternate assessments may be developed to measure alternate
achievement standards, modified achievement standards, or grade-level achievement
standards.
Annual Measurable Objective (AMO): a set of federally-required state-established
benchmarks serving as targets for performance among and across student subgroups, schools,
and districts.
Assessment: Any systematic method of obtaining evidence to draw inferences about people or
programs. Assessment may include both formal methods, such as large-scale state
assessments, or less formal classroom-based procedures, such as quizzes, class projects, and
teacher questioning.
Bias. In a statistical context, a systematic error in a test score. In discussing test fairness, bias
may refer to construct under-representation or construct-irrelevant components of test scores
that differentially affect the performance of different groups of test takers.
Classification Errors: (aka, Type I/Type II errors). Errors made when the application of a cut
score or other determinant results in ―failing‖ a student/school/district when they should have
passed (Type I error) or ―passing‖ someone who should have failed (Type II error).
Considerations for an AA-MAS
Page 373
Cognition: How students represent knowledge and develop competence in a subject domain.
Cognitive architecture: The information processing system that determines the flow of
information and how it is acquired, stored, represented, revised, and accessed in the mind.
Cognitive complexity: An individual psychological characteristic related to the type of thinking
a student would need to do in order to correctly answer an item or task, including the number of
mental structures a student would have to use, how abstract the item structures were, and how
elaborately the structures interacted with each other.
Comparability: The degree to which similar inferences can be made from the outcomes of two
or more assessments.
Construct: As applied to assessment, the complete set of knowledge, skills, abilities, or traits
representing a particular domain of knowledge, such as American history, reading
comprehension, study skills, writing ability, logical reasoning, honesty, intelligence, and so forth.
Content domain: The set of behaviors, knowledge, and skills to be measured by a test,
represented in a detailed specification and often organized into categories by which items are
classified.
Content standards: Statements of the knowledge and skills that students are expected to
learn. Content standards should drive instruction and test construction.
Curriculum: The knowledge and skills in subject matter areas that teachers are supposed to
teach and students are supposed to learn including a scope or breadth of content in a given
subject area and a sequence for learning.
Curriculum-Based Measurement (CBM): A method teachers use to determine how students
are progressing in basic academic areas such as math, reading, writing, and spelling by testing
students weekly using a short measure that is then graphed and analyzed to see if the progress
is sufficient to meet the target.
Cut score: A point on a score scale at or above which test takers are classified in one way and
below which they are classified in a different way. For example, if a cut score is set at 60, then
people who score 60 and above may be classified as ―passing‖ and people who score 59 and
below classified as ―failing.‖
Decision consistency: A measure of the reliability of the classification decision. Decision
consistency estimates the extent to which, if an examinee were administered a test on two
separate occasions, the same classification decision (whether pass or fail) would be made.
Declarative knowledge: Information about ―the way the world is.‖
Depth-of-knowledge: Degree of depth or performance complexity required to
understand/perform academic content/process found in content standards or assessment items;
a description of different ways students interact with content measured by how deeply students
must understand the content in order to respond.
Considerations for an AA-MAS
Page 374
Difficulty: In assessment, the proportion of respondents answering the item correctly.
Conceptually, it is based on underlying knowledge and cognitive processes required to answer
an item correctly.
Differential Item Functioning (DIF): A statistical property of a test items in which different
groups of test takers who have the same total test score have different performance on
particular items.
Disability Category: Assignments that qualify a child for special education and related
services; different from a medical diagnosis. Federal law (IDEA 2004, Part B) has 13 disability
categories that States must use to determine if students, ages 3-21, are eligible to receive
special education and related services: Autism, Deaf-blindness, Deafness, Emotional
disturbance, Hearing impairment, Mental retardation, Orthopedic impairment, Other health
impairment, Specific learning disability, Speech or language impairment, Traumatic brain injury,
Visual impairment including blindness, or Multiple disabilities.
Distractor: An incorrect option presented to an examinee in a multiple choice item.
Domain Sampling. The process of selecting test items to represent a specified universe of
performance.
Dynamic Evaluation: As used in the context of validity, dynamic evaluation refers to the notion
that evaluative judgments will be updated as new information about the assessment system is
presented. In other words, dynamic evaluation refers to the idea that the evaluation continues to
move (or adjust) as new information is gathered.
General Assessment: Assessments given to the majority of students at each grade level such
as the state end of year tests.
Grade-level achievement standard: A minimum cut score and written description that provide
an expectation for a level of performance aligned to the grade level in which a student is enrolled
or that matches his biological age.
Guiding Philosophy: The fundamental beliefs or set of assumptions that guide the conception,
development, implementation, and continuous improvement of an approach, program, practice,
or policy.
Individualized Education Program (IEP): A written plan and legal document designed to meet
the unique educational needs of one child, as defined by federal regulations under the
Individuals with Disabilities Education Act (IDEA). An IEP describes a child‘s present level of
functioning; specific areas that need special services; annual goals; short-term objectives;
services to be provided; and the method of evaluation to be implemented for children 3 to 21
years of age who have been determined eligible for special education.
Instruction: The methods of teaching and the learning activities used to help students master
the content and objectives specified by a curriculum and encompasses the activities of both
teachers and students.
Interim flexibility (aka 2% proxy): The practice of allowing states that meet certain criteria to
count as proficient for purposes of AYP a portion of the students with disabilities; the portion is
determined by dividing 2 percent by the percent of students with disabilities in the state.
Considerations for an AA-MAS
Page 375
Interpretative argument: A plan specifying the proposed interpretations and uses of test
results by laying out the network of inferences and assumptions leading to the observed
performances to the conclusions and decisions based on the performances.
Item format: The variety of test item structures or types that can be used to measure
examinees' knowledge, skills, and abilities, typically including multiple-choice or selectedresponse, open-ended or constructed-response, essay, or performance task.
Learning progression: Description of successively more sophisticated ways of reasoning
within a content domain that follow one another as students learn.
Measurement error: The differences between observed scores and the theoretical true score;
the amount of uncertainty in reporting scores; the degree of inherent imprecision based on test
content, administration, scoring, or examinee conditions within the measurement process that
produce errors in the interpretation of student achievement.
Metacognition: The set of skills and processes that allow one to reflect on, monitor, adjust and
direct one‘s own thinking and learning.
Modified achievement standard: A minimum cut score and written description that provide an
expectation for a level of performance aligned to grade level content standards but less rigorous
than a grade-level achievement standard.
Modification: Changes made in both instructional and assessment situations that are
individualized to student needs. In the context of assessment, changes are made to the content,
format, and/or administrative procedures of a test in order to accommodate test takers who are
unable to take the original test under standard test conditions. Unlike accommodations,
modifications may directly or indirectly compromise the validity of the content standard by
changing the construct. Modifications include a much wider range of supports and instructional
scaffolding than do accommodations but can be effectively used in combination with
accommodations in instructional and assessment situations when individualized to the student's
strengths and needs. Modifications are intended to allow for meaningful participation and
enhanced learning.
No Child Left Behind (NCLB): The 2001 reauthorization of the Elementary and Secondary
Education Act that added new requirements for annual student testing and annual measurable
objectives with a focus on improving achievement of all students and reducing the achievement
gap.
Opportunity to learn: The provision of learning conditions, such as curriculum, courses, and
instruction, including suitable adjustments, to maximize a student‘s chances of attaining the
desired learning outcomes, such as the mastery of content standards.
Parallel Forms: Two or more assessments that provide similar outcomes (true scores) of the
construct being measured.
Performance Index: A measure that weights scores at each performance level and awards a
school partial credit for students whose achievement improves, even though they may not yet
be proficient, and can be included in determining the adequate yearly progress (AYP) of the
school.
Considerations for an AA-MAS
Page 376
Portfolio (assessment): An assessment comprising the collection and analysis of examinee
work samples, typically consisting of performance tasks gathered over a specific period of time;
often used to assess special populations who have difficulty with standard paper-and-pencil
assessments.
Procedural knowledge: Information about ―how things are done.‖
Progress Monitoring: The process of collecting and evaluating data to make decisions about
the adequacy of student progress toward a goal by evaluating the student‘s actual rate of
change compared to the expected rate of change.
Prompt. Any form of verbal, nonverbal, or physical cue to structure, pace, or signal a response
to be made by the student. Examples include verbalisms like, ‗continue,‘ next,‘ now what,‘ or
reminders of each step; physical guidance is an example of a prompt.
Reliability: The characteristic of test scores of being dependable, generally conceptualized as
stability or consistency over both time and items.
Response to Intervention (RTI): A comprehensive, multi-step process that closely monitors
how the student is responding to different types of services and instruction.
Sampling Error: The error associated with observations from a sample instead of the whole
population, used to quantify the expected range within which the true population value might be
located relative to the sample data.
Scaffolding: An approach to enhancing items derived from supports provided during learning
that are gradually removed when learning becomes solidified and/or the learner becomes more
independent. Includes any type of structural assistance introduced to organize information or
guide responses embedded in the presentation of the item or task. These supports are not
intended to change the construct being measured.
Standard setting: An activity in which a procedure is applied systematically to gather and
analyze human judgment for the purpose of deriving one of more cut scores for a test.
Standards-based IEP: An individualized education plan that specifically refers to instruction of
the state‘s academic standards for the student‘s enrolled grade and focuses on aligning
instruction of students with disabilities to the academic content that all students at that grade
level should know and be able to do.
Student with disabilities (SWD): In the Individuals with Disabilities Act, a student with
disabilities is defined as ―a child evaluated in accordance with §§300.530-300.536 as having
mental retardation, a hearing impairment including deafness, a speech or language impairment,
a visual impairment including blindness, serious emotional disturbance (hereafter referred to as
emotional disturbance), an orthopedic impairment, autism, traumatic brain injury, another health
impairment, a specific learning disability, deaf-blindness, or multiple disabilities, and who, by
reason thereof, needs special education and related services.‖
Test domain: The portion of all knowledge and skill in a subject matter area that is selected to
be assessed because there is consensus that it represents what is important for teachers to
teach and for students to learn.
Considerations for an AA-MAS
Page 377
Test specifications: A detailed description for a test that specifies the number or proportion of
items that assess each content and process/skill area. Aka, test blueprint.
Theory of Action: Originally drawn from sociology and organizational studies, theory of action
is used in the education context to refer to higher level view of the interpretative argument.
Essentially, it provides an overview of how the specific components of the testing/educational
system are intended to work in concert to bring about the desired aims.
Universal design: The creation of products and environments meant to be usable by all
people, to the greatest extent possible, without the need for adaptation or specialization.
Universal design for learning: A framework for designing educational environments that
enables all learners to gain knowledge, skills, and enthusiasm for learning, by simultaneously
reducing barriers to the curriculum and providing rich supports for learning.
Validity: The extent to which inferences and actions made on the basis of a set of scores are
appropriate and justified by evidence. It is the most important aspect of the quality of a test.
Validity refers to how the scores are used rather than to the test itself.
Validity argument: An evaluation of the completeness and coherence of proposed
interpretations and uses of test results, based on both empirical evidence and logic, as specified
by the interpretative argument.
Validity evaluation: The full set of activities related to evaluating the proposed interpretations
and uses of test results and includes the interpretative and validity arguments as well as the
validity studies plan and the actual studies themselves.
Working memory: A kind of cognitive energy level or ―resource‖ that exists in limited amounts,
with substantial individual variations.
Considerations for an AA-MAS
Page 378