Large Scale Assessments and High Stakes Decisions: Facts, Cautions and Guidelines
National Association of School Psychologists
A basic premise of standards-based reform is that all children can learn. Although some students may require more time or varied instruction, standards-based reform articulates the expectation that all students will be provided the opportunity to meet a common set of instructional goals. Shaped by legislation and challenged by research, large-scale assessment programs have generated considerable controversy and inconsistency as states and districts attempt to measure student attainment of high standards. The purpose of this document is to highlight the factors influencing large-scale assessment, summarize cautions in implementing 'high stakes' testing programs and offer some basic guidelines to policymakers and administrators.
In the 1994 reauthorization of the Elementary and Secondary Education Act (ESEA), states were required to set challenging standards for student achievement, and develop and administer assessments to measure student progress towards those standards. According to the National Research Council's 1999 guide ('Testing, Teaching and Learning'), the intended outcome of these requirements is higher student achievement. Mandated core components of 'standards based reform' include a) content and performance standards set for all students; b) development of tools to measure the progress of all students toward the standards; and c) accountability systems that require continuous improvement of student achievement. The 2002 reauthorization of ESEA, the No Child Left Behind Act, ups the ante for states and requires annual assessments in grades 3 through 8. It further requires states and schools to meet 'adequate yearly progress' by increasing test scores. Schools that fail to meet these goals will face a series of graduated sanctions.
Students with disabilities are specifically included in the definition of 'all' students in ESEA. Additionally, the Individuals with Disabilities Education Act (IDEA) requires states to include children with disabilities in general state and district-wide assessment programs, with appropriate accommodations where necessary, and to report annually on the participation rates, performance, and progress of students with disabilities. When students with disabilities cannot participate in testing, even with accommodations, states are required to include students using alternate assessments. Estimates of the prevalence of severe disabilities indicate that only 1-2% of all students will need to take alternate assessments.
During the past 30 years, Congress also enacted civil rights statutes to ensure equal access to a quality education for many targeted populations of students, such as students of color, economically disadvantaged students, students with disabilities, students with limited English proficiency and females (e.g., Civil Rights Act of 1965; Americans with Disabilities Act; Section 504 of the Rehabilitation Act of 1973). The focus of these statutes is on the right of all students to have an equal opportunity to achieve high academic standards as measured by appropriate assessment processes. Fairness or integrity of the assessment process is indicated by careful alignment of standards, curriculum and instruction, assessment, and opportunity to learn.
Students with disabilities, students from disadvantaged backgrounds, and students who do not speak English as a first language have struggled to overcome low educational expectations for some time. Federal laws such as ESEA and IDEA can be seen as legislated attempts to 'raise the bar.' Research, such as that conducted by the National Center for Educational Outcomes, has documented a number of positive consequences of including students with disabilities, such as increased levels of performance, higher expectations for student achievement, increased access to the general education curriculum, and improved teaching and instruction. However, as standards based reform is implemented, concerns have been raised that, despite good intentions, the potential exists for unintended negative outcomes.
Concerns and Cautions Regarding Large-Scale Assessment
Recognizing multiple purposes of large-scale assessment. Nearly all states have established large-scale assessment programs to measure student progress toward standards. However, multiple stakeholders want assessments to meet a variety of needs--educators want test results to inform instruction; taxpayers want to know that the money they spend translates into student learning; governors want assurances that their students are achieving at a level similar to or better than students in other states. Yet, we know that tests should be designed for the specific purpose that they are intended to serve and for the population that they will measure. While some states have tried to meet the demands for accountability by modifying existing large-scale assessments or developing new tests, many other states continue to use single tests for multiple purposes of system accountability, school improvement, and measurement of individual student or group performance, regardless of their intended use and inherent limitations. In efforts to quickly meet new mandates, this inappropriate practice is likely to increase.
High stakes and negative consequences. Tests are considered high stakes for students when the results are used to make critical decisions about the individual's access to educational opportunity, grade-level retention or promotion, graduation from high school, or receipt of a standard or alternative diploma. These kinds of decisions all have immediate as well as long-range impact on the student. In some states, high stakes also are attached to test results for school systems--teachers, administrators and schools are rewarded or sanctioned based on student performance. When such high stakes are attached to assessment scores, there is greater potential for manipulation of data and negative consequences:
1) Use of a single test score in making promotion/retention decisions. Test development experts agree that it is not appropriate to use performance on a single standardized test for making high-stakes decisions for individuals. Yet, increasingly, states are requiring schools and school districts to use state test scores to determine whether students should be promoted to the next grade level, resulting in higher numbers of retained students each year. Extensive research over many years indicates that repeating a grade does not usually improve student achievement and further demonstrates a strong relationship between retention and increased dropout rates.
2) Use of a single test score in graduation decisions. Some states have adopted exit exams for high school graduation, resulting in the denial of a diploma to thousands of students based on a single, standardized test, without regard to their classroom performance, teachers' recommendations, or access to adequate classroom resources, quality instruction, or pupil services support. Although states may allow students to take these tests several times, multiple administrations of the same type of measure do not improve the reliability of the scores or reduce the general limitations of such testing.
3) Use of test performance as a basis for systems level rewards and sanctions. There is strong political support for the use of assessment results for system accountability, as reflected in the new provisions of ESEA. Administrators and teachers are rewarded or sanctioned based on student test performance. In some schools, these consequences negatively affect instruction for all students, including students with disabilities, by dramatically narrowing the curriculum and encouraging the use of generally inappropriate 'quick fix' approaches to student learning.
4) Impact on mainstream education. Large scale testing programs can also have unintended but negative effects on the education provided to all students by unduly emphasizing basic skills to the exclusion of the arts, sciences and humanities; creating a culture of "teach-to-the-test"; increasing the psychological stress on children and families; and decreasing teacher job satisfaction. Further, schools may focus limited resources on efforts they believe will directly improve test scores, rather than on strategies to improve school climate and student learning.
Interpreting Results from Large Scale Assessments: Cautions and Considerations
Districts and states need to take great care when applying results of large-scale assessments to high stakes decisions such as graduation, retention, merit pay, etc. Factors that influence the accurate interpretation of standards test results include the following:
Who is assessed? There may be inconsistency in the groups of students included in the state assessment reports over time. For example, when students are retained or drop out, the group of students included in testing changes. Further, some states and districts continue to (illegally) exclude some students with disabilities and/or limited English proficiency from their assessment systems. New mandates and funding incentives may further pressure states to exclude groups of students who might tend to score below standards or require extensive accommodations.
Additionally, due to high student mobility in some areas, the group of students tested may vary significantly from one year to the next. In some schools 30% or more of the students turn over annually. Therefore many students tested in one school in a given year may have received much of their instruction elsewhere. Measuring effectiveness of instruction across schools or over time is severely compromised with highly mobile populations.
What tests are used and what do they measure? Assessment programs vary in many ways across states. Some states use assessments to compare individual student performance to a national group, while others compare individual student performance to established performance standards. Further, states differ in the content measured and how proficiency is defined and demonstrated. For example, some states may use 'minimum standards' while others use 'high standards.' Although trends within states are more reliable for comparison than cross-state trends, even comparisons within a given state must be reported carefully to assure similar data and standards are used. Additionally, it is essential that parents and the community understand what skills are addressed by testing programs. While academic skills (reading, math, writing) may be the presumed content measured, states' and districts' assessments often include 'critical thinking skills' components. Because these components are more related to aptitude or ability than to attainment of academic standards, inclusion of such measures may lead to misinterpretation or inappropriate use of test results. Where such tests are used, additional negative outcomes might include ability grouping and fixed expectations.
What accommodations were provided? States have different rules about the kinds of accommodations that can and cannot be used for students with disabilities and students with limited English proficiency. It is important to know not only that students were given appropriate accommodations, but also the kinds of accommodations given and how reliably these accommodations were provided in order to make accurate interpretations of results.
Recommended Guidelines for Large Scale Assessments
High stakes testing for individual students. Performance on a standardized test (or on multiple administrations of the same test) should not be the sole determinant in any 'either/or' decision about instructional placement, promotion, or graduation. Rather, results should be used as indicators of the need for early intervention, programmatic changes, or more specific evaluation of learning problems. Multiple measures of academic achievement, as well as teacher and family input, must be utilized in making such important decisions.
Test design and selection. Tests must meet professional standards for technical adequacy, must be reliable and valid for the purpose for which they are being used, and must be designed to measure progress towards standards. Further, tests must be appropriately aligned with standards, curriculum, and instruction, and be administered on a timeline that allows for adequate instruction. Tests should be critically reviewed to determine whether they are appropriate and valid for the widest range of children and youth, including students with disabilities and students with limited English proficiency. Critically important, tests used for making decisions about individuals must be more reliable than those used for comparing groups. Assessments that are universally designed can significantly reduce the need for accommodations and increase comparability of scores. School districts and state policymakers should consult assessment experts such as school psychologists to assure that their testing programs properly address technical issues tied to test construction and selection. States must distribute information about the amount of error in the test scores and caution educators and parents about the limitations of tests.
Including all students. To support the necessary inclusion of all students in standards-based assessment programs, schools must appropriately implement accommodations, modifications, or alternate assessments when necessary. In addition, data on all students' performance should be included in all reports, clearly identifying which students are included in each data set. Educators must exercise caution in interpreting the results of large-scale assessments for all individuals and groups of students, particularly those with disabilities or limited English proficiency, as these tests may not adequately reflect the content or level of their instruction or address realistic instructional goals.
Training. Research (e.g., National Center for Educational Outcomes) indicates that the rapid pace of implementing inclusive large-scale assessments has not allowed sufficient opportunities for training. Ongoing staff development and training opportunities for educators is critical as this reform initiative moves forward.
Evaluation and research. All standards testing programs must have a systematic evaluation plan to address appropriate selection and implementation of procedures as well as student and system outcomes. Evaluation must consider the match between the assessment's purpose and its design; differences in performances across groups of students and possible sources of bias; the degree to which all students are included; compliance with intended accommodations, modifications, and alternative procedures; and the intended and unintended consequences of the testing program for individual students, staff, schools, districts, and states. Ongoing research is essential to address many unanswered questions about large-scale assessment and to assure development and implementation of accurate, fair, and useful measures of student and system progress.
Funding. Large-scale assessment is a complex and costly endeavor when appropriately designed and implemented. However, such assessment is even more costly when inadequate or inappropriate procedures are used. Any mandated state-wide or district-wide assessment program must have sufficient funds and timelines to ensure that a high quality process is developed, implemented, maintained, and evaluated.
Heubert, J. P. & Hauser, R. M. (Eds.). (1999). High stakes: Testing for tracking, promotion, and graduation. Washington: National Academy Press. Available: www.nap.edu/books/0309062802/html/index.html
National Association of School Psychologists (www.nasponline.org)
National Research Council (1999). Testing, Teaching, and Learning: A Guide for States and School Districts. Washington DC: Author.
U.S. Department of Education, Office of Civil Rights (2000). The use of tests when making high stakes decisions for students: A resource guide for educators and policy makers. Washington, DC: Author.
Some of this material is excerpted or adapted from articles in the NASP Communiqué, authored by staff of the National Center for Education Outcomes, and from 'Students with Disabilities in Standards-Based Reform' by Martha Thurlow, published by OSEP (2000). NASP acknowledges the significant contributions of Dr. Cammy Lehr (National Center for Educational Outcomes) to the development of this document.
© 2002, National Association of School Psychologists, 4340 East West Hwy #402; Bethesda, MD 20814