This is the accessible text file for GAO report number GAO-09-911 entitled 'No Child Left Behind Act: Enhancements in the Department of Education's Review Process Could Improve State Academic Assessments' which was released on September 24, 2009. This text file was formatted by the U.S. Government Accountability Office (GAO) to be accessible to users with visual impairments, as part of a longer term project to improve GAO products' accessibility. Every attempt has been made to maintain the structural and data integrity of the original printed product. Accessibility features, such as text descriptions of tables, consecutively numbered footnotes placed at the end of the file, and the text of agency comment letters, are provided but may not exactly duplicate the presentation or format of the printed version. The portable document format (PDF) file is an exact electronic replica of the printed version. We welcome your feedback. Please E-mail your comments regarding the contents or accessibility features of this document to Webmaster@gao.gov. This is a work of the U.S. government and is not subject to copyright protection in the United States. It may be reproduced and distributed in its entirety without further permission from GAO. Because this work may contain copyrighted images or other material, permission from the copyright holder may be necessary if you wish to reproduce this material separately. Report to the Chairman, Committee on Health, Education, Labor, and Pensions, U.S. Senate: United States Government Accountability Office: GAO: September 2009: No Child Left Behind Act: Enhancements in the Department of Education's Review Process Could Improve State Academic Assessments: GAO-09-911: GAO Highlights: Highlights of GAO-09-911, a report to the Chairman, Committee on Health, Education, Labor, and Pensions, U.S. Senate. Why GAO Did This Study: The No Child Left Behind Act of 2001 (NCLBA) requires states to develop high-quality academic assessments aligned with state academic standards. Education has provided states with about $400 million for NCLBA assessment implementation every year since 2002. GAO examined (1) changes in reported state expenditures on assessments, and how states have spent funds; (2) factors states have considered in making decisions about question (item) type and assessment content; (3) challenges states have faced in ensuring that their assessments are valid and reliable; and (4) the extent to which Education has supported state efforts to comply with assessment requirements. GAO surveyed state and District of Columbia assessment directors, analyzed Education and state documents, and interviewed assessment officials from Maryland, Rhode Island, South Dakota, and Texas and eight school districts in addition to assessment vendors and experts. What GAO Found: States reported their overall annual expenditures for assessments have increased since passage of the No Child Left Behind Act of 2001 (NCLBA), which amended the Elementary and Secondary Education Act of 1965 (ESEA), and assessment development was the largest expense for most states. Forty-eight of 49 states that responded to our survey said that annual expenditures for ESEA assessments have increased since NCLBA was enacted. Over half of the states reported that overall expenditures grew due to development of new assessments. Test and question—also referred to as item—development was most frequently reported by states to be the largest ESEA assessment expense, followed by scoring. State officials in selected states reported that alternate assessments for students with disabilities were more costly than general population assessments. In addition, 19 states reported that assessment budgets had been reduced by state fiscal cutbacks. Cost and time pressures have influenced state decisions about assessment item type—such as multiple choice or open/constructed response—and content. States most often chose multiple choice items because they can be scored inexpensively within tight time frames resulting from the NCLBA requirement to release results before the next school year. State officials also reported facing trade-offs between efforts to assess highly complex content and to accommodate cost and time pressures. As an alternative to using mostly multiple choice, some states have developed practices, such as pooling resources from multiple states to take advantage of economies of scale, that let them reduce cost and use more open/constructed response items. Challenges facing states in their efforts to ensure valid and reliable assessments involved staff capacity, alternate assessments, and assessment security. State capacity to provide vendor oversight varied, both in terms of number of state staff and measurement-related expertise. Also, states have been challenged to ensure validity and reliability for alternate assessments. In addition, GAO identified several gaps in assessment security policies that were not addressed in Education’s review process for overseeing state assessments that could affect validity and reliability. An Education official said that assessment security was not a focus of its review. The review process was developed before recent efforts to identify assessment security best practices. Education has provided assistance to states, but issues remain with communication during the review process. Education provided assistance in a variety of ways, and states reported that they most often used written guidance and Education-sponsored meetings and found these helpful. However, Education’s review process did not allow states to communicate with reviewers during the process to clarify issues, which led to miscommunication. In addition, state officials were in some cases unclear about what review issues they were required to address because Education did not identify for states why its decisions differed from the reviewers’ written comments. What GAO Recommends: GAO recommends that Education (1) incorporate assessment security best practices into its peer review protocols, (2) improve communication during the review process, and (3) identify for states why its peer review decisions in some cases differed from peer reviewers’ written comments. Education indicated that it believes its current practices are sufficient regarding our first recommendation and agreed with GAO’s other two recommendations. View [hyperlink, http://www.gao.gov/products/GAO-09-911] or key components. For more information, contact Cornelia Ashby at (202) 512- 7215 or AshbyC@gao.gov. [End of section] Contents: Letter: Background: States Reported That Assessment Spending Has Increased Since NCLBA Was Enacted and Test Development Has Been the Largest Assessment Cost in Most States: States Have Considered Cost and Time in Making Decisions about Assessment Item Type and Content: States Faced Several Challenges in Their Efforts to Ensure Valid and Reliable ESEA Assessments, including Staff Capacity, Alternate Assessments, and Assessment Security: Education Has Provided Assistance to States, but the Peer Review Process Did Not Allow for Sufficient Communication: Conclusions: Recommendations for Executive Action: Agency Comments and Our Evaluation: Appendix I: Objectives, Scope, and Methodology: Appendix II: Student Population Assessed on ESEA Assessments in School Year 2007-08: Appendix III: Validity Requirements for Education's Peer Review: Appendix IV: Reliability Requirements for Education's Peer Review: Appendix V: Alignment Requirements for Education's Peer Review: Appendix VI: Item Types Used Most Frequently by States on General and Alternate Assessments: Appendix VII: Comments from the U.S. Department of Education: Appendix VIII: GAO Contact and Staff Acknowledgments: Table: Table 1: Illustration of Depth of Knowledge Levels: Figures: Figure 1: Examples of Item Types: Figure 2: State Expenditures for Assessment Vendors, 2007-08: Figure 3: ESEA Assessment Activities That Received the Largest Share of States' Total ESEA Assessment Costs, 2007-08: Figure 4: The Number of States Reporting Changes in Item Type Use on ESEA Assessments since 2002: Figure 5: Number of FTEs Dedicated to ESEA Assessments in States, 2007- 08: Abbreviations: ARRA: The American Recovery and Reinvestment Act of 2009: AYP: Adequate Yearly Progress: CCSSO: Council of Chief State School Officers: Education: U.S. Department of Education: ESEA: The Elementary and Secondary Education Act: FTE: full-time equivalent: LEP: Limited English Proficiency: NCLBA: The No Child Left Behind Act of 2001: NECAP: The New England Common Assessment Program: SFSF: State Fiscal Stabilization Fund: TAC: Technical Advisory Committee: [End of section] United States Government Accountability Office: Washington, DC 20548: September 24, 2009: The Honorable Tom Harkin: Chairman: Committee on Health, Education, Labor, and Pensions: United States Senate: Dear Mr. Chairman: The No Child Left Behind Act of 2001 (NCLBA), which amended the Elementary and Secondary Education Act of 1965 (ESEA), aims to improve student achievement, particularly among poor and minority students. To reach this goal, the law requires states to develop high-quality academic assessments aligned with challenging state academic standards that measure students' knowledge of reading/language arts, mathematics, and science. Student achievement as measured by these assessments is the basis for school accountability, including corrective actions such as removing principals or implementing new curricula. NCLBA required that states test all students in grades 3 through 8 annually in mathematics and reading/language arts and at least once in one of the high school grades by the 2005-06 school year. It also required that states test students in science at least once in elementary, middle, and high school by 2007-08. Education has provided states with about $400 million for ESEA assessment[Footnote 1] implementation every year since 2002. To ensure that assessments appropriately measure student achievement, the law requires that assessments be valid and reliable and that they measure higher-order thinking skills and understanding. The U.S. Department of Education's (Education) guidance defines valid assessments as those for which results accurately reflect students' knowledge in a subject, and it defines reliable assessments as those that produce similar results among students with similar levels of knowledge. The law also directs states to assess all students, including those with disabilities. For children with significant cognitive disabilities, Education has directed states to develop alternate assessments that measure achievement on alternate state standards designed for these children. States have primary responsibility for developing ESEA assessments and ensuring their technical quality, and can work with private assessment vendors that provide a range of assessment services, such as question (item)[Footnote 2] development and scoring. Education provides technical assistance and oversees state implementation of ESEA assessment requirements through its standards and assessments peer review process. In Education's peer review process, a group of experts- -reviewers--review whether states are complying with ESEA assessment requirements, including requirements for validity and reliability, and that assessments cover the full depth and breadth of academic standards. NCLBA increased the number of assessments that states are required to develop compared to prior years, and states have reported facing challenges in implementing these new assessments. Little is known about how federal, state, and local funds have been used for assessments, or how states make key decisions as they implement ESEA assessments, such as whether to use multiple choice or open/constructed response items. To shed light on these issues and to assist Congress in its next reauthorization of ESEA, the Chairman of the Senate Committee on Health, Education, Labor, and Pensions requested that GAO provide information on the quality and funding of student assessments. Specifically, you asked GAO to examine the following questions: (1) How have state expenditures on ESEA assessments changed since NCLBA was enacted in 2002, and how have states spent funds? (2) What factors have states considered in making decisions about item type and content of their ESEA assessments? (3) What challenges, if any, have states faced in ensuring the validity and reliability of their ESEA assessments? (4) To what extent has Education supported state efforts to comply with ESEA assessment requirements? To conduct our work, we used a variety of methods, including reviews of Education and state documents, a 50-state survey, interviews with Education officials, and site visits in 4 states. We also reviewed relevant federal laws and regulations. To learn whether state expenditures for assessments have changed since NCLBA enactment, and if so, how they have changed, and how states have spent these funds, we analyzed responses to our state survey, which was administered to assessment directors of the 50 states and the District of Columbia in January 2009. We received responses from 49 states, for a 96 percent response rate.[Footnote 3] We also conducted site visits to four states--Maryland, Rhode Island, South Dakota, and Texas--that reflect a range of population size and results from Education's assessment peer review. On these site visits we interviewed state officials, officials from two districts in each state, and technical advisors to each state. To gather information about factors states consider when making decisions about the item type and content of their assessments, we analyzed our survey and interviewed state officials and state technical advisors from our site visit states. We reviewed studies from our site visit states that evaluated the alignment between state standards and assessments, including the level of cognitive complexity in assessments, and spoke with representatives from four alignment organizations--organizations that evaluate the alignment between state standards and assessments--that states hire to conduct these studies. These alignment organizations included the three organizations that states most frequently hire to conduct alignment studies, and representatives of a fourth alignment organization that was used by one of our site visit states. In addition, we interviewed four assessment vendors that were selected because they work with a large number of states to obtain their perspectives on ESEA assessments and the assessment industry. We used our survey to collect information about challenges states have faced in ensuring validity and reliability. We also reviewed state documents from our site visit states, such as test security documentation for peer review and assessment security protocols, and interviewed state officials. We asked our site visit states to review a checklist created by the Council of Chief State School Officers (CCSSO), an association of state education agencies. A CCSSO official indicated that this checklist is still valid for state assessment programs. To address the extent of Education's support and oversight of ESEA assessment implementation, we reviewed Education guidance, summaries of Education assistance, and peer review protocols and training documents, and interviewed Education officials in charge of the peer review and assistance efforts. We conducted this performance audit from August 2008 through September 2009 in accordance with generally accepted government auditing standards. Those standards require that we plan and perform the audit to obtain sufficient, appropriate evidence to provide a reasonable basis for our findings and conclusions based on our audit objectives. We believe that the evidence obtained provides a reasonable basis for our findings and conclusions based on our audit objectives. Background: The ESEA was created to improve the academic achievement of disadvantaged children.[Footnote 4] The Improving America's Schools Act of 1994, which reauthorized ESEA, required states to develop state academic content standards, which specify what all students are expected to know and be able to do, and academic achievement standards, which are explicit definitions of what students must know and be able to do to demonstrate proficiency.[Footnote 5] In addition, the 1994 reauthorization required assessments aligned to those standards. The most recent reauthorization of the ESEA, the No Child Left Behind Act of 2001, built on the 1994 requirements by, among other things, increasing the number of grades and subject areas in which states were required to assess students.[Footnote 6] NCLBA also required states to establish goals for the percentage of students attaining proficiency on ESEA assessments that are used to hold schools and districts accountable for the academic performance of students. Schools and districts failing to meet state proficiency goals for 2 or more years must take actions, proscribed by NCLBA, in order to improve student achievement. Every state, district, and school receiving funds under Title I, Part A of ESEA--the federal formula grant program dedicated to improving the academic achievement of the disadvantaged--is required to implement the changes described in NCLBA. ESEA assessments may contain one or more of various item types, including multiple choice, open/constructed response, checklists, rating scales, and work samples or portfolios. GAO's prior work has found that item type is a major factor influencing the overall cost of state assessments and that multiple choice items are less expensive to score than open/constructed response items.[Footnote 7] Figure 1 describes several item types states use to assess student knowledge. Figure 1: Examples of Item Types: [Refer to PDF for image: illustration] Multiple choice item: An item written that offers students two or more answers from which to select the best answer to the question or stem statement. Open/constructed response item: A student provides a brief response to a posed question that may be a few words or longer. Checklist: Often used for observing performance in order to keep track of a student's progress or work over time. This can also be used to determine whether students have met established criteria on a task. Rating scale: Used to provide feedback of a student's performance on an assessment based on pre-determined criteria. Work samples or portfolios: A teacher presents tasks for students to perform, and then rates and records students’ response for each task. These ratings and responses are recorded in student portfolios. Source: GAO; images, Art Explosion. [End of figure] NCLBA authorized additional funding to states for these assessments under the Grants for State Assessments program. Each year states have received a $3 million base amount regardless of its size, plus an additional amount based on its share of the nation's school age population. States must first use the funds to pay the cost of developing the additional state standards and assessments. If a state has already developed the required standards and assessments, NCLBA allows these funds to be used to administer assessments or for other activities, such as developing challenging state academic standards in subject areas other than those required by NCLBA and ensuring that state assessments remain valid and reliable. In years that the grants have been awarded, the Grants for Enhanced Assessment Instruments program (Enhanced Assessment grants) has provided between $4 million and $17 million to several states. Applicants for Enhanced Assessment grants receive preference if they plan to fund assessments for students with disabilities, for Limited English Proficiency (LEP) students or are part of a collaborative effort between states. States may also use other federal funds for assessment-related activities, such as funds for students with disabilities, and funds provided under the American Recovery and Reinvestment Act of 2009 (ARRA).[Footnote 8] ARRA provides about $100 billion for education through a number of different programs, including the State Fiscal Stabilization Fund (SFSF). In order to receive SFSF funds, states must provide certain assurances, including that the state is committed to improving the quality of state academic standards and assessments. In addition, Education recently announced plans to make $4.35 billion in incentive grants available to states through SFSF on a competitive basis. These grants--referred to by Education as the Race to the Top program--can be used by states for, among other things, improving the quality of assessments. Like other students, those with disabilities must be included in statewide ESEA assessments. This is accomplished in different ways, depending on the effects of a student's disability. Most students with disabilities participate in the regular statewide assessment either without accommodations or with appropriate accommodations, such as having unlimited time to complete the assessments, using large print or Braille editions of the assessments, or being provided individualized or small group administration of the assessments. States are permitted to use alternate academic achievement standards to evaluate the performance of students with the most significant cognitive disabilities. Alternate achievement standards must be linked to the state's grade-level academic content standards but may include prerequisite skills within the continuum of skills culminating in grade- level proficiency. For these students, a state must offer alternate assessments that measure students' performance. For example, the alternate assessment might assess students' knowledge of fractions by splitting groups of objects into two, three, or more equal parts. While alternate assessments can be administered to all eligible children, the number of proficient and advanced scores from alternate assessments based on alternate achievement standards included in Adequate Yearly Progress (AYP)[Footnote 9] decisions generally is limited to 1 percent of the total tested population at the state and district levels. [Footnote 10] In addition, states may develop modified academic achievement standards--achievement standards that define proficiency at a lower level than the achievement standards used for the general assessment population, but are still aligned with grade-level content standards--and use alternate assessments based on those standards for eligible students whose disabilities preclude them from achieving grade- level proficiency within the same period of time as other students. States may include scores from such assessments in making AYP decisions but those scores generally are capped at 2 percent of the total tested population.[Footnote 11] States are also required to include LEP students in their ESEA assessments. To assess these students, states have the option of developing assessments in students' native languages. These assessments are designed to cover the content in state academic content standards at the same level of difficulty and complexity as the general assessments.[Footnote 12] In the absence of native language assessments, states are required to provide testing accommodations for LEP students, such as providing additional time to complete the test, allowing the use of a dictionary, administering assessments in small groups, or simplified instructions. By law, Education is responsible for determining whether or not states' assessments comply with statutory requirements. The standards and assessments peer review process used by Education to determine state compliance began under the 1994 reauthorization of ESEA and is an ongoing process that states go through whenever they develop new assessments. In the first step of the peer review process, a group of at least three experts--peer reviewers--examines evidence submitted by the state to demonstrate compliance with NCLBA requirements, identifies areas for which additional state evidence is needed, and summarizes their comments. The reviewers are state assessment directors, researchers, and others selected for their expertise in assessments. After the peer reviewers complete their review, an Education official assigned to the state reviews the peer reviewers' comments and the state's evidence and, using the same guidelines as the peer reviewers, makes a recommendation on whether the state meets, partially meets, or does not meet each assessment system critical element and on whether the state's assessment system should be approved. A group of Education officials from the relevant Education offices--including a representative from the Office of the Assistant Secretary of Elementary and Secondary Education--meet as a panel to discuss the findings. The panel makes a recommendation about whether to approve the state and the Assistant Secretary makes the final approval decision. Afterwards a letter is sent to the state notifying them of whether they have been approved, and--if the state was not approved--Education's letter identifies why the state was not approved. States also receive a copy of the peer reviewers' written comments as a technical assistance tool to support improvement. Education has the authority to withhold federal funds provided for state administration until it determines that the state has fulfilled ESEA assessment requirements and has taken this step with several states since NCLBA was enacted. Education also provides states with technical assistance in meeting the academic assessment requirements. ESEA assessments must be valid and reliable for the purposes for which they are intended and aligned to challenging state academic standards. Education has interpreted these requirements in its peer review guidance to mean that states must show evidence of technical quality-- including validity and reliability--and alignment with academic standards. According to Education's peer review guidance, the main consideration in determining validity is whether states have evidence that their assessment results can be interpreted in a manner consistent with their intended purposes. See appendix III for a complete description of the evidence used by Education to determine validity. A reliable assessment, according to the peer review guidance, minimizes the many sources of unwanted variation in assessment results. To show evidence of consistency of assessment results, states are required to (1) make a reasonable effort to determine the types of error that may distort interpretations of the findings, (2) estimate the likely magnitude of these distortions, and (3) make every possible effort to alert the users to this lack of certainty. As part of this requirement, states are required to demonstrate that assessment security guidelines are clearly specified and followed. See appendix IV for a full description of the reliability requirements. Alignment, according to Education's peer review guidance, means that states' assessment systems adequately measure the knowledge and skills specified in state academic content standards. If a state's assessments do not adequately measure the knowledge and skills specified in its content standards or if they measure something other than what these standards specify, it will be difficult to determine whether students have achieved the intended knowledge and skills. See appendix V for details about the characteristics states need to consider to ensure that its standards and assessments are aligned. In its guidance and peer review process, Education requires that--as one component of demonstrating alignment between state assessments and academic standards--states must demonstrate that their assessments are as cognitively challenging as their standards. To demonstrate this, states have contracted with organizations to assess the alignment of their ESEA assessments with the states' standards. These organizations have developed similar models of measuring the cognitive challenge of assessment items. For example, the Webb model categorizes items into four levels--depths of knowledge--ranging in complexity from level 1-- recall, which is the least difficult for students to answer, to level 4--extended thinking, which is the most difficult for students to answer. Table 1 provides an illustration, using the Webb model, of how depth of knowledge levels may be measured. Table 1: Illustration of Depth of Knowledge Levels: Depth of knowledge level: Level 1 - Recall; Description: Includes the recall of information such as a fact, definition, term, or a simple procedure, as well as performing a simple algorithm or applying a formula. Other key words that signify a Level 1 activity include "identify," "recall," "recognize," "use," and "measure." Depth of knowledge level: Level 2 - Skill/Concept; Description: Includes the engagement of some mental processing beyond a habitual response. A Level 2 assessment item requires students to make some decisions as to how to approach the problem or activity. Keywords that generally distinguish a Level 2 item include "classify," "organize," "estimate," "make observations," "collect and display data," and "compare data." These actions imply more than one step. Other Level 2 activities include noticing and describing non-trivial patterns; explaining the purpose and use of experimental procedures; carrying out experimental procedures; making observations and collecting data; classifying, organizing, and comparing data; and organizing and displaying data in tables, graphs, and charts. Depth of knowledge level: Level 3 - Strategic Thinking; Description: Requires reasoning, planning, using evidence, and a higher level of thinking than the previous two levels. In most instances, requiring students to explain their thinking is a Level 3. Activities that require students to make conjectures are also at this level. The cognitive demands at Level 3 are complex and abstract. The complexity does not result from the fact that there are multiple answers, a possibility for both Levels 1 and 2, but because the task requires more demanding reasoning. Other Level 3 activities include drawing conclusions from observations, citing evidence and developing a logical argument for concepts, explaining phenomena in terms of concepts, and using concepts to solve problems. Depth of knowledge level: Level 4 - Extended Thinking; Description: Requires complex reasoning, planning, developing, and thinking most likely over an extended period of time. At Level 4, the cognitive demands of the task should be high and the work should be very complex. Students should be required to make several connections-- relate ideas within the content area or among content areas--and would have to select one approach among many alternatives on how the situation should be solved, in order to be at this highest level. Level 4 activities include developing and proving conjectures; designing and conducting experiments; making connections between a finding and related concepts and phenomena; combining and synthesizing ideas into new concepts; and critiquing experimental designs. Source: Norman L. Webb, Issues Related to Judging the Alignment of Curriculum Standards and Assessments, April 2005. [End of table] States Reported That Assessment Spending Has Increased Since NCLBA Was Enacted and Test Development Has Been the Largest Assessment Cost in Most States: Assessment Expenditures Have Grown in Nearly Every State since 2002, and Most States Reported Spending More for Vendors than State Staff: State ESEA assessment expenditures have increased in nearly every state since the enactment of NCLBA in 2002, and the majority of these states reported that adding assessments was a major reason for the increased expenditures. Forty-eight of 49 states that responded to our survey said their states' overall annual expenditures for ESEA assessments have increased, and over half of these 48 states indicated that adding assessments to their state assessment systems was a major reason for increased expenditures[Footnote 13]. In other cases, even states that were testing students in reading/language arts and mathematics in all of the grades that were required when NCLBA was enacted reported that assessment expenditures increased due to additional assessments. For example, officials in Texas--which was assessing general population students in all of the required grades at the time NCLBA was enacted-- told us that they created additional assessments for students with disabilities. In addition to the cost of adding new assessments, states reported that increased vendor costs have also contributed to the increased cost of assessments. On our survey, increasing vendor costs was the second most frequent reason that states cited for increased ESEA assessment costs. One vendor official told us that shortly after the 2002 enactment of NCLBA, states benefited from increased competition because many new vendors entered the market and wanted to gain market share, which drove down prices. In addition, vendors were still learning about the level of effort and costs required to complete this type of work. Consequently, as the ESEA assessment market has stabilized and vendors have gained experience pricing assessments, the cost of ESEA assessment contracts have increased to reflect the true cost of vendor assessment work. One assessment vendor that works with over half of the states on ESEA assessments told us that vendor costs have also been increasing as states have been moving toward more sophisticated and costly procedures and reporting. Nearly all states reported higher expenditures for assessment vendors than for state assessment staff. According to our survey responses, 44 out of the 46 states that responded said that of the total cost of ESEA assessments, much more was paid to vendors than to state employees. For example, one state reported it paid approximately $83 million to vendors and approximately $1 million to state employees in the 2007-08 school year. The 20 states that provided information for the costs of both vendors and state employees in 2007-08 reported spending more than $350 million for vendors to develop, administer, score, and report the results of ESEA assessments--more than 10 times the amount they spent on state employees. State expenditures for ESEA assessment vendors, which were far larger than expenditures for state staff, varied. Spending for vendors on ESEA assessments in the 40 states that reported spending figures on our survey ranged from $500,000 to $83 million, and in total all 40 states spent more than $640 million for vendors to develop, administer, score, and report results of the ESEA assessments in 2007-08. The average cost in these 40 states was about $16 million. See figure 2 for the distribution of state expenditures for vendors in 2007-08. Figure 2: State Expenditures for Assessment Vendors, 2007-08: [Refer to PDF for image: vertical bar graph] Dollar amounts states spent on assessment vendors: Below $15 million; Number of states: 26. Dollar amounts states spent on assessment vendors: $15 million-$60 million; Number of states: 12. Dollar amounts states spent on assessment vendors: Over $60 million; Number of states: 2. Source: GAO survey. [End of figure] Over half of the states reported that the majority of their funding for ESEA assessments--including funding for expenses other than vendors-- came from their state governments. Of the 44 states that responded to the survey question, 26 reported that the majority of their state's total funding for ESEA assessments came from state government funds for 2007-08, and 18 reported that less than half came from state funds. For example, officials from one state that we visited, Maryland, reported that 84 percent of their total funding for ESEA assessments came from state government funds and that 16 percent of the state's funding for ESEA assessments came from the federal Grants for State Assessments program in 2007-08. In addition to state funds, all states reported using Education's Grants for State Assessments for ESEA assessments, and 17 of 45 states responding to the survey question reported using other federal funds for assessments. One state reported that all of its funding for ESEA assessments came from the Grants for State Assessments program. The other federal funds used by states for assessments included Enhanced Assessment grants. The Majority of States Reported That Assessment Development Was the Most Expensive Component of the Assessment Process; Development Has Been More Challenging for Small States: More than half of the states reported that assessment development costs were more expensive than any other component of the student assessment process, such as administering or scoring assessments.[Footnote 14] Twenty-three of 43 states that responded to the question in our survey told us that test and item development and revision was the largest assessment cost for 2007-08. For example, Texas officials said that the cost of developing tests is higher than the costs associated with any other component of the assessment process. After test and item development costs, scoring was most frequently cited as the most costly activity, with 12 states reporting it as their largest assessment cost. Similarly, states reported that test and item development was the largest assessment cost for alternate assessments, followed by scoring. See figure 3 for more information. Figure 3: ESEA Assessment Activities That Received the Largest Share of States' Total ESEA Assessment Costs, 2007-08: [Refer to PDF for image: horizontal bar graph] General ESEA assessment activities: Number of states responding: Test development: 23; Number of states responding: Scoring of test: 12; Number of states responding: Test administration: 8. Alternate assessment with alternate achievement standards: Number of states responding: Test development: 23; Number of states responding: Scoring of test: 15; Number of states responding: Test administration: 4. Alternate assessment with modified achievement standards: Number of states responding: Test development: 7; Number of states responding: Scoring of test: 2; Number of states responding: Test administration: 1. ` Source: GAO survey data. [End of figure] The cost of developing assessments was affected by whether states release assessment items to the public.[Footnote 15] According to state and vendor officials, development costs are related to the percentage of items states release to the public every year because new items must be developed to replace released items. According to vendor officials, nearly all states release at least some test items to the public, but they vary in the percentage of items that they release. In states that release 100 percent of their test items each year, assessment costs are generally high and steady over time because states must develop additional items every year. However, some states release only a portion of items. For example, Rhode Island state officials told us that they release 20 to 50 percent of their reading and math assessment items every year. State and vendor officials told us that despite the costs associated with the release of ESEA assessment items, releasing assessment items builds credibility with parents and helps policymakers and the public understand how assessment items relate to state content standards. The cost of development has been particularly challenging for smaller states.[Footnote 16] Assessment vendors and Education officials said that the price of developing an assessment is fixed regardless of state size and that, as a result smaller states with fewer students usually have higher per pupil costs for development. For example, state assessment officials from South Dakota told us that their state and other states with small student populations have the same development costs as states with large assessment populations, regardless of the number of students being assessed. In contrast to development costs, administration and scoring costs vary based on the number of students being assessed and the item types used. Although large and small states face similar costs for development, each has control over some factors-- such as item type and releasing test items--that can increase or decrease costs. Selected States Are Concerned about Costs of Developing and Administering Alternate Assessments for Students with Disabilities and Budget Cuts: State officials from the four states we visited told us that alternate assessments based on alternate achievement standards were far more expensive on a per pupil basis than general assessments. In Maryland, state officials told us that general assessments cost $30 per pupil, and alternate assessments cost between $300 and $400 per pupil. Rhode Island state officials also reported that alternate assessments cost much more than general assessments. These officials also said that, in addition to direct costs, the administration of alternate assessments has resulted in significant indirect costs, such as professional development for teachers. Technical advisors and district and state officials told us that developing alternate assessments is costly on a per pupil basis because the number of students taking these assessments is small. See appendix VI for more information about states' use of various item types for alternate assessments. In light of recent economic conditions, many states have experienced fiscal reductions, including within ESEA assessment budgets. As of January 2009, 19 states said their state's total ESEA assessment budget had been reduced as a result of state fiscal cutbacks. Fourteen states said their state's total ESEA assessment budgets had not been reduced, but 10 of these states also said they anticipated future reductions. Half of the 46 states that responded to the question told us that in developing their budget proposals for the next fiscal year they anticipated a reduction in state funds for ESEA assessments. For example, one state that responded to our survey said it had been asked to prepare for a 15 percent reduction in state funds. States Have Considered Cost and Time in Making Decisions about Assessment Item Type and Content: States Used Primarily Multiple Choice Items in Their ESEA Assessments Because They Are Cost-Effective and Can Be Scored within Tight Time Frames for Reporting Results: States have most often chosen multiple choice items over other item types on assessments. In 2003, we reported that the majority of states used a combination of multiple choice and a limited number of open- ended items for their assessments.[Footnote 17] According to our survey, multiple choice items comprise the majority of unweighted score points (points)--the number of points that can be earned based on the number of items answered correctly--for ESEA reading/language arts and mathematics general assessments administered by most responding states. Specifically, 38 of 48 states that responded said that multiple choice items comprise all or most of the points for their reading/language arts assessments, and 39 states said that multiple choice items comprise all or most of the points for mathematics assessments. Open/ constructed response items are the second most frequently used item type for reading/language arts or mathematics general assessments. All states that responded to our survey reported using multiple choice items on their general reading/language arts and mathematics assessments, and most used some open/constructed response items. See appendix VI for more information about the types of items used by states on assessments. Some states also reported on our survey that, since 2002, they have increased their use of multiple choice items and decreased their use of other item types. Of the 47 states that responded to our survey question, 10 reported increasing the use of multiple choice items on reading/language arts general assessments, and 11 reported increasing their use of multiple choice items on mathematics assessments. For example, prior to the enactment of NCLBA, Maryland administered an assessment that was fully comprised of open/constructed response items, but state assessment officials told us that they have moved to an assessment that is primarily multiple choice and plan to eliminate open/constructed response items from assessments. However, several states reported that they have decreased the use of multiple choice items and/or increased the use of open/constructed response items. For more information about how states reported changing the mix of items on their assessments, see figure 4. Figure 4: The Number of States Reporting Changes in Item Type Use on ESEA Assessments since 2002: [Refer to PDF for image: illustration] Reading/language arts: Upward changes, multiple choice: 10; Upward changes, open/constructed response: 5; Downward changes, multiple choice: 4; Downward changes, open/constructed response: 11. Mathematics: Upward changes, multiple choice: 11; Upward changes, open/constructed response: 5; Downward changes, multiple choice: 4; Downward changes, open/constructed response: 13. Source: GAO. [End of figure] States reported that total cost of use and the ability to score assessments quickly were key considerations in choosing multiple choice item types. In response to our survey, most states reported considering the cost of different item types and the ability to score the tests quickly when making decisions about item types for ESEA assessments. Officials from the states we visited reported choosing multiple choice items because they can be scored inexpensively within challenging time frames. State officials, assessment experts, and vendors told us that multiple choice item types are scored electronically, which is inexpensive, but that open/constructed response items are usually scored manually, making them more expensive to score. Multiple scorers of open/constructed response items are sometimes involved to ensure consistency, but this also increases costs. In addition, state officials said that training scorers of open/constructed response items is costly. For example, assessment officials in Texas told us that the state has a costly 3-week long training process for teachers to become qualified to assess the open-ended responses. State assessment officials also told us that they used multiple choice items because they can be scored quickly, and assessment vendors reported that states were under pressure to release assessment results to the public before the beginning of the next school year in accordance with NCLBA requirements. For example, assessment officials from South Dakota told us that they explored using open/constructed response items on their assessments but that they ultimately determined it would not be feasible to return results in the required period of time. States also reported considering whether item types would meet certain technical considerations, such as validity and reliability. Texas assessment officials said that using multiple choice items allows the state more time to check test scores for reliability. States Reported That the Use of Multiple Choice Items in Assessments Has Limited the Content and Complexity of What They Test: Despite the cost-and time-saving benefits to states, the use of multiple choice items on assessments has limited the content included in the assessments. Many state assessment officials, alignment experts, and vendor officials told us that items possess different characteristics that affect how amenable they are to testing various types of content. State officials and their technical advisors told us that they have faced significant trade-offs between their efforts to assess highly cognitively complex content and their efforts to accommodate cost and time pressures. All four of the states that we visited reported separating at least a minor portion of standards into those that are used for ESEA assessment and those that are for instructional purposes only. Three of the four states reported that standards for instructional purposes only included highly cognitively complex material that could not be assessed using multiple choice items. For example, a South Dakota assessment official told us that a cognitively complex portion of the state's new reading standards could not be tested by multiple choice; therefore, the state identified these standards as for instructional purposes only and did not include them in ESEA assessments. In addition to these three states, officials from the fourth state--Maryland--told us that they do not include certain content in their standards because it is difficult to assess. Many state officials and experts we spoke with told us that multiple choice items limit states from assessing highly cognitively complex content. For example, Texas assessment officials told us that some aspects of state standards, such as a student's ability to conduct scientific research, cannot be assessed using multiple choice. Representatives of the alignment organizations told us that it is difficult, and in some cases not possible, to measure highly cognitively complex content with multiple choice items. Three of the four main groups that conduct alignment studies, including alignment studies for all of our site visit states, told us that states cannot measure content of the highest complexity with multiple choice and that ESEA assessments should include greater cognitive complexity. Maryland state officials said that before NCLBA was enacted the state administered an assessment that was fully comprised of open/constructed response items. Maryland technical advisors told us that because the state faced pressure to return assessment results quickly, the state changed its test to include mostly multiple choice items, but that this had limited the content assessed in the test. According to an analysis performed in 2002 after the enactment of NCLBA, of 36 scorable items on one Maryland high school mathematics assessment, about 94 percent of the items were rated at the two lowest levels of cognitive demand, out of four levels based on an independent alignment review.[Footnote 18], [Footnote 19] Representatives of all four alignment groups told us that multiple choice items can measure intermediate levels of cognitive complexity, but it is difficult and costly to develop these items. These alignment experts said that developing multiple choice items that measure cognitively challenging content is more expensive and time- consuming than for less challenging multiple choice items. Vendor officials had differing views about whether multiple choice items assess cognitively complex content. For example, officials from three vendors said that multiple choice items can address cognitively complex content. However, officials from another vendor told us that it is not possible to measure certain highly cognitively complex content with multiple choice items. Moreover, two other vendors told us that there are certain content and testing purposes that are more amenable to assessment with item types other than with multiple choice items. Several of the vendors reported that there are some standards that, because of practical limitations faced by states, cannot be assessed on standardized, paper-and-pencil assessments. For example, one vendor official told us that performance-based tasks enabled states to assess a wider variety of content but that the limited funds and quick turnaround times required under the law require states to eliminate these item types. Although most state officials, state technical advisors, and alignment experts said that ESEA assessments should include more open/constructed response items and other item types, they also said that multiple choice items have strengths and that there are challenges with other types of items. For example, in 2008 a national panel of assessment experts appointed and overseen by Education reported that multiple choice items do not measure different aspects of mathematics competency than open/constructed response items. Also, alignment experts said that multiple choice items can quickly and effectively assess lower level content, which is also important to assess. Moreover, open/constructed response items do not always assess highly complex content, according to an alignment expert. This point has been corroborated by several researchers who have found that performance tasks, which are usually intended to assess higher-level cognitive content may inadvertently measure low-level content.[Footnote 20] For example, one study describes a project in which students were given a collection of insects and asked to organize them for display. High-scoring students were supposed to demonstrate complex thinking skills by sorting insects based on scientific classification systems, rather than less complex criteria, such as whether or not insects are able to fly. However, analysis of student responses showed that high scorers could not be distinguished from low scorers in terms of their knowledge of the insects' features or of the scientific classification system.[Footnote 21] The presence or absence of highly complex content in assessments can impact classroom curriculum. Several research studies have found that content contained in assessments influences what teachers teach in the classroom. One study found that including open-ended items on an assessment prompted teachers to ask students to explain their thinking and emphasize problem solving more often.[Footnote 22] Assessment experts told us that the particular content that is tested impacts classroom curriculum. For example, one assessment expert told us that the focus on student results, combined with the focus on multiple choice items, has led to teachers teaching a narrow curriculum that is focused on basic skills. Under the federal peer review process, Education and peer reviewers examined evidence that ESEA assessments are aligned with the state's academic standards. Specifically, peer reviewers examined state evidence that assessments cover the full depth and breadth of the state academic standards in terms of cognitive complexity and level of difficulty. However, consistent with federal law, it is Education's policy not to directly examine a state's academic standards, assessments, or specific test items.[Footnote 23] Education officials told us that it is not the department's role to evaluate standards and assessments themselves and that few at Education have the expertise that would be required to do so. Instead, they explained that Education's role is to evaluate the evidence provided by states to determine whether the necessary requirements are met. States Used Alternative Practices to Reduce Cost and Meet Quick Turnaround Times while Attempting to Assess Complex Material: As an alternative to using mostly multiple choice items on ESEA assessments, states used a variety of practices to reduce costs and meet quick turnaround times while also attempting to assess cognitively complex material. For example, some states have developed and administered ESEA assessments in collaboration with other states, which has allowed these states to pool resources and use a greater diversity of item types. In addition, some states administered assessments at the beginning of the year that test students on material taught during the prior year to allow additional time for scoring of open-response items, or administered assessments online to decrease turnaround time for reporting results. States have reported advantages and disadvantages associated with each of these practices: * Collaboration among states: All four states that we visited-- Maryland, Texas, South Dakota, and Rhode Island--indicated interest in collaborating with other states in the development of ESEA reading/ language arts or mathematics assessments, as of March 2009, but only Rhode Island was. Under the New England Common Assessments Program (NECAP), Rhode Island, Vermont, New Hampshire, and Maine share a vendor, a common set of standards, and item development costs. Under this agreement, the cost of administration and scoring are based on per pupil rates. NECAP states use a combination of multiple choice, short answer, and open/constructed response items. According to Rhode Island assessment officials, more rigorous items, including half of their math items, are typically embedded within open/constructed response items. When asked about the benefits of working in collaboration with other states to develop ESEA assessments, assessment officials for Rhode Island told us that the fiscal savings are very apparent. Specifically, they stated that Rhode Island will save approximately $250,000 per year with the addition of Maine to the NECAP consortium because, as Rhode Island assessment officials noted, Maine will take on an additional share of item development costs. Also, officials said that with a multi- state partnership, Rhode Island is able to pay more for highly skilled people who share a common vision. Finally, they said that higher standards are easier to defend politically as part of collaboration because there are more stakeholders in favor of them. An assessment expert from New Hampshire said that the consortium has been a "lifesaver" because it has saved the state considerable funding and allowed it to meet ESEA assessment requirements. Assessment experts from Rhode Island and New Hampshire told us that there are some challenges to working in collaboration with other states to develop ESEA assessments. Because decisions are made by consensus and the NECAP states have philosophical differences in areas such as item development, scoring, and use of item types, decision-making is a lengthy process. In addition, a Rhode Island official said that assessment leadership in the states changes frequently, which also makes decision-making difficult. * Beginning of year test administration: NECAP states currently administer assessments in the beginning of the year, which eases time pressures associated with the scoring of open/constructed response items. As a result, the inclusion of open/constructed response items on the assessment has been easier because there is enough time to meet NCLBA deadlines for reporting results. However, Rhode Island officials said that there are challenges to administering tests at the beginning of the year. For example, one official stated that coordinating testing with the already challenging start of school is daunting. For example, she said that state assessment officials are required to use school enrollment lists to print school labels for individual tests, but because enrollment lists often change in the beginning of the year, officials are required to correct a lot of data. District assessment officials also cited this as a major problem. * Computerized testing: Of the states we visited, Texas was the only one administering a portion of its ESEA assessments online, but Maryland and Rhode Island were moving toward this goal. One assessment vendor with whom we spoke said that many states are anticipating this change in the not-too-distant future. Assessment vendors and state assessment officials cited some major benefits of online assessment. For example, one vendor told us that online test administration reduces costs by using technology for automated scoring. They also told us that states are using online assessments to address cognitively complex content in standards that are difficult to assess, such as scientific knowledge that is best demonstrated through experiments. In addition, assessment officials told us that online assessments are less cumbersome and easier than paper tests to manage at the school level if schools have the required technology and that they enable quicker turnaround on scores. State and district assessment officials and a vendor with whom we spoke also cited several challenges associated with administering tests online, including security of the tests; variability in students' computer literacy; strain on school computer resources, computer classrooms/labs, and interruption of classroom/lab instruction; and lack of necessary computer infrastructure. States Faced Several Challenges in Their Efforts to Ensure Valid and Reliable ESEA Assessments, including Staff Capacity, Alternate Assessments, and Assessment Security: States Varied in Their Capacity to Guide and Oversee Vendors: State officials are responsible for guiding the development of the state assessment program and overseeing vendors, but states varied in their capacity to fulfill these roles. State officials reported that they are responsible for making key decisions about the direction of their states' assessment programs, such as whether to develop alternate assessments based on modified achievement standards, or online assessments. In addition, state officials said that they are responsible for overseeing the assessment vendors used by their states. However, state assessment offices varied based on the measurement expertise of their staff. About three-quarters of the 48 responding states had at least one state assessment staff member with a Ph.D. in psychometrics or another measurement-related field. Three states-- North Carolina, South Carolina, and Texas--each reported having five staff with this expertise. However, 13 states did not have any staff with this expertise. In addition, states varied in the number of full- time equivalent professional staff (FTE) dedicated to ESEA assessments from 55 professional staff in Texas to 1 professional staff in Idaho and the District of Columbia. See figure 5 for more information about the number of FTEs dedicated to ESEA in the states. Figure 5: Number of FTEs Dedicated to ESEA Assessments in States, 2007- 08: [Refer to PDF for image: vertical bar graph] Number of FTEs: 1 to 5; Number of states: 11. Number of FTEs: 6 to 15; Number of states: 21. Number of FTEs: 16 to 25; Number of states: 6. Number of FTEs: 26 and up; Number of states: 6. Source: GAO survey. [End of figure] Small states had less assessment staff capacity than larger states. The capacity of state assessment offices was related to the amount of funding spent on state assessment programs in different states, according to state officials. For example, South Dakota officials told us that they had tried to hire someone with psychometric expertise but that they would need to quadruple the salary that they could offer to compete with the salaries being offered by other organizations. State officials said that assessment vendors can often pay higher salaries than states and that it is difficult to hire and retain staff with measurement-related expertise. State officials and assessment experts told us that the capacity of state assessment offices was the key challenge for states implementing NCLBA. Greater state capacity allows states to be more thoughtful in developing their state assessment systems, and provide greater oversight of their assessment vendors, according to state officials. Officials in Texas and other states said that having high assessment staff capacity--both in terms of number of staff and measurement- related expertise--allows them to research and implement practices that improve student assessment. For example, Texas state officials said that they conduct research regarding how LEP students and students with disabilities can best be included in ESEA assessments, which state officials said helped them improve the state's assessments for these students. In contrast, officials in lower capacity states said that they struggled to meet ESEA assessment requirements and did not have the capacity to conduct research or implement additional strategies. For example, officials in South Dakota told us that they had not developed alternate assessments based on modified achievement standards because they did not have the staff capacity or funding to implement these assessments. Also, of three states we visited that completed a checklist of important assessment quality control steps,[Footnote 24] those with fewer assessment staff addressed fewer key quality control steps. Specifically, Rhode Island, South Dakota, and Texas reviewed and completed a CCSSO[Footnote 25] checklist on student assessment, the Quality Control Checklist for Processing, Scoring, and Reporting. These states varied with regard to fulfilling the steps outlined by this checklist. For example, state officials in Texas, which has 55 full- time professional staff working on ESEA assessments, including multiple staff with measurement-related expertise, reported that they fulfill 31 of the 33 steps described in the checklist and address the 2 other steps in certain circumstances. Officials in Rhode Island, who told us that they have six assessment staff and work in conjunction with other states in its assessment consortium, said that they fulfill 27 of the 33 steps. South Dakota, which had three professional full-time staff working on ESEA assessments--and no staff with measurement-related expertise--addressed nine of the steps, according to state officials. For example, South Dakota officials said that the state does not verify the accuracy of answer keys in the data file provided by the vendor using actual student responses, which increases the risk of incorrectly scoring assessments. Because South Dakota does not have staff with measurement-related expertise and has fewer state assessment staff, there are fewer individuals to fulfill these quality control steps than in a state with greater capacity, according to state officials. Having staff with psychometric or other measurement-related expertise improved states' ability to oversee the work of vendors. For example, the CCSSO checklist recommends that states have psychometric or other research expertise for nearly all of the 33 steps. Having staff with measurement-related expertise allows states to know what key technical questions or data to ask of vendors, according to state officials, and without this expertise they would be more dependent on vendors. State advisors from technical advisory committees (TAC)--panels of assessment experts that states convene to assist them with technical oversight-- said that TACs are useful, but that they generally only meet every 6 months. For example, one South Dakota TAC member said that TACs can provide guidance and expertise, but that ensuring the validity and reliability of a state assessment system is a full-time job. The TAC member said that questions arise on a regular basis for which it would be helpful to bring measurement-related expertise to bear. Officials from assessment vendors varied in what they told us. Several told us that states do not need measurement-related expertise, but others said that states needed this expertise on staff. Education's Inspector General (OIG) found reliability issues with management controls over state ESEA assessments.[Footnote 26] Specifically, the OIG found that Tennessee did not have sufficient monitoring of contractor activities for the state assessments such as ensuring that individuals scoring open/constructed response items had proper qualifications. In addition, the OIG found that the state lacked written policies and procedures describing internal controls for scoring and reporting. States Have Faced Challenges in Ensuring the Validity and Reliability of Alternate Assessments for Students with Disabilities: Although most states have met peer review expectations for validity and reliability of their general assessments, ensuring the validity of alternate assessments for students with disabilities is still a challenge. For example, our review of Education documents as of July 15, 2009, showed that 12 states' reading/language arts and mathematics standards and assessment systems--which include general assessments and alternate assessments based on alternate achievement standards--had not received full approval under Education's peer review process and that alternate assessments were a factor preventing approval in 11 of these states.[Footnote 27] In the four states[Footnote 28] where alternate assessments were the only issue preventing full approval, technical quality (which includes validity and reliability) or alignment was a problem. For example, in a letter to Hawaii education officials dated October 30, 2007, documenting steps the state must take to gain full approval of its standards and assessments system, Education officials wrote that Hawaii officials needed to document the validity and alignment of the state alternate assessment. States had more difficulty assessing the validity and reliability of alternate assessments using alternate achievement standards than ESEA assessments for the general student population. In our survey, nearly two-thirds of the states reported that assessing the validity and reliability of alternate assessments with alternate achievement standards was either moderately or very difficult. In contrast, few states reported that either validity or reliability were moderately or very difficult for general assessments. We identified two specific challenges to the development of valid and reliable alternate assessments with alternate achievement standards. First, ensuring the validity and reliability of these alternate assessments has been challenging because of the highly diverse population of students being assessed. Alternate assessments are administered to students with a wide range of significant cognitive disabilities. For example, some students may only be able to communicate by moving their eyes and blinking. As a result, measuring the achievement of these students often requires greater individualization. In addition, because these assessments are administered to relatively small student populations, it can be difficult for states to gather the evidence needed to demonstrate their validity and reliability. In addition, developing valid and reliable alternate assessments with alternate achievement standards has been challenging for states because there is a lack of research about the development of these assessments, according to state officials and assessment experts. States have been challenged to design alternate assessments that appropriately measure what eligible students know and provide similar scores for similar levels of performance. Experts and state officials told us that more research would help them ensure validity and reliability. An Education official agreed that alternate assessments are still a challenge for states and said that there is little consensus about what types of alternate assessments are psychometrically appropriate. Although there is currently a lack of research, Education is providing assistance to states with alternate assessments and has funded a number of grants to help states implement alternate assessments. States that have chosen to implement alternate assessments with modified achievement standards and native language assessments have faced similar challenges, but relatively few states are implementing these assessments. On our survey, 8 of the 47 states responding to this question reported that in 2007-08 they administered alternate assessments based on modified achievement standards, which are optional for states, and several more reported being in the process of developing these assessments. Fifteen states reported administering native language assessments, which are also optional. States reported mixed results regarding the difficulty of assessing the validity and reliability of these assessments, with about two-thirds indicating that each of these tasks was moderately or very difficult for both the alternate assessments with modified achievement standards and native language assessments. Officials in states that are not offering these assessments reported that they lacked the funds necessary to develop these assessments or that they lacked the staff or time. States Have Taken Measures to Ensure Assessment Security, but Gaps Exist: The four states that we visited and districts in those states had taken steps to ensure the security of ESEA assessments. Each of the four states had a test administration manual that is intended to establish controls over the processes and procedures used by school districts when they administer the assessments. For example, the Texas test administration manual covered procedures for keeping assessment materials secure prior to administration, ensuring proper administration, returning student answer forms for scoring, and notifying administrators in the event of assessment irregularities. States also required teachers administering the assessments to sign forms saying that they would ensure security and had penalties for teachers or administrators who violated the rules. For example, South Dakota officials told us that teachers who breach the state's security measures could lose their teaching licenses. Despite these efforts, there have been a number of documented instances of teachers and administrators cheating in recent years. For example, researchers in one major city examined the frequency of cheating by test administrators.[Footnote 29] They estimated that at least 4 to 5 percent of the teachers and administrators cheated on student assessments by changing student responses on answer sheets, providing correct answers to students, or illegitimately obtaining copies of exams prior to the test date and teaching students using knowledge of the precise exam items. Further, the study found that teachers' and administrators' decisions about whether to cheat responded to incentives. For example, when schools faced the possibility of being sanctioned for low assessment scores, teachers were more likely to cheat. In addition, the study found that teachers in low-performing classrooms were more likely to cheat. In our work, we identified several gaps in state assessment security policies. For example, assessment security experts said that many states do not conduct any statistical analyses of assessment results to detect indications of cheating. Among our site visit states, one state-- Rhode Island--reported analyzing test results for unexpected gains in schools' performance. Another state, Texas, had conducted an erasure analysis to determine whether schools or classrooms had an unusually high number of erased responses that were changed to correct responses, possibly indicating cheating. These types of analysis were described as a key component of assessment security by security experts. In addition, we identified one specific state assessment policy where teachers had an opportunity to change test answers. South Dakota's assessment administration manual required classroom teachers to inspect all student answers to multiple choice items and darken any marks that were too light for scanners to read. Further, teachers were instructed to erase any stray marks, and ensure that, when a student had changed an answer, the unwanted response was completely erased. This policy provided teachers an opportunity to change the answers, and improve assessment results. South Dakota officials told us that they had considered taking steps to mitigate the potential for cheating, such as contracting for an analysis that would identify patterns of similar erasure marks that could indicate cheating, but that it was too expensive for the state. States' assessment security policies and procedures were examined during Education's standards and assessments peer review process. According to Education's peer review guidance, which Education officials told us were the criteria used by peer reviewers to examine state assessment systems, states must demonstrate the establishment of clear criteria for the administration, scoring, analysis, and reporting components of state assessment systems. One example of evidence of adequate security procedures listed in the peer review guidance was that the state uses training and monitoring to ensure that people responsible for handling or administering state assessments properly protect the security of the assessments. Education indicated that a state could submit as evidence documentation that the state's test security policy and consequences for violating the policy are communicated to educators, and documentation of the state's plan for training and monitoring assessment administration. According to Education officials, similar indicators are included in Education's ongoing efforts to monitor state administration and implementation of ESEA assessment requirements. Although test security was included as a component in the peer review process, we identified several gaps in how the process evaluated assessment security. The peer reviewers did not examine whether states used any type of data analysis to review student assessment results for irregularities. When we spoke with Education's director of student achievement and school accountability programs--who manages the standards and assessments peer review process--about how assessment security was examined in the peer review process, he told us that security was not a focus of peer review. The official indicated that the review already required a great deal of time and effort by reviewers and state officials and that Education had given a higher priority to other assessment issues. In addition, the state policy described above in which teachers darken marks or erase unwanted responses was approved through the peer review process. The Education official who manages the standards and assessments peer review process told us that the peer review requirements, including the assessment security portion, were based on the Standards for Educational and Psychological Testing[Footnote 30] when they were developed in 1999. The Standards provide general guidelines for assessment security, such as that test users have the responsibility of protecting the security of test materials at all times. However, they do not provide comprehensive best practices for assessment security issues. The Association of Test Publishers developed draft assessment security guidelines in 2007. In addition, in the spring 2010, the Association of Test Publishers and CCSSO plan to release a best practices guide for state departments of education that is expected to offer best practices for test security. Education has made certain modifications to the peer review process but does not plan to update the assessment security requirements. Education updated the peer review protocols to address issues with the alternate assessment using modified achievement standards after those regulations were released. In addition, Education has made certain modifications to the process that were requested by states. However, Education officials indicated that they do not have plans to update the peer review assessment security requirements. Education Has Provided Assistance to States, but the Peer Review Process Did Not Allow for Sufficient Communication: Education Provided Technical Assistance with Assessments, including Those for Students with Disabilities and LEP Students: Education provided technical assistance to states in a variety of ways. Education provided technical assistance through meetings, written guidance, user guides, contact with Education staff, and assistance from its Comprehensive Centers and Clearinghouses. In our survey, states reported they most often used written guidance and Education- sponsored meetings and found these helpful. States reported mixed results in obtaining assistance from Education staff. Some reported receiving consistent helpful support while others reported staff were not helpful or responsive. Relevant program offices within Education provided additional assistance as needed. For example, the Office of Special Education Programs provided assistance to states in developing alternate assessments for students with disabilities and the Office of English Language Acquisition, Language Enhancement, and Academic Achievement for Limited English Proficient Students assisted states in developing their assessments for LEP students. In addition, beginning in 2002, Education awarded competitive Enhanced Assessment Grants to state collaboratives working on a variety of assessment topics such as developing valid and reliable assessments for students with disabilities and LEP students. For example, one consortium of 14 states and jurisdictions was awarded about $836,000 to investigate and provide information on the validity of accommodations for future assessments for LEP students with disabilities, a group of students with dual challenges. States awarded grants are required to share the outcomes of their projects with other states at national conferences; however, since these are multi-year projects, the results of many of them are not yet available. Education's Peer Review Process Did Not Allow Direct Communication between States and Reviewers to Quickly Resolve Problems: Education's peer review process did not allow for direct communication between states and peer reviewers that could have more quickly resolved questions or problems that arose throughout the peer review process. After states submitted evidence of compliance with ESEA assessment requirements to Education, groups of three reviewers examined the materials and made recommendations to Education. To ensure the anonymity of the peer reviewers, Education did not permit communication between reviewers and state officials. Instead, Education liaisons periodically relayed peer reviewers' questions and comments to the states and then relayed answers back to the peer reviewers. Education officials told us the assurance of anonymity was an important factor in their ability to recruit peer reviewers who may not have felt comfortable making substantive comments on states' assessment systems if their identity was known. However, the lack of direct communication resulted in miscommunication and prevented quick resolutions to questions arising during the peer review process. State officials and reviewers told us that there was not enough communication between states and reviewers during the process, preventing the quick resolution of questions that arose during the review process. For example, one state official reported on our survey that the lack of direct communication with peer reviewers led to misunderstandings that could have been readily resolved with a conversation with peer reviewers. A number of the peer reviewers who we surveyed provided similar information. For example, one said that the process was missing direct communication, which would allow state officials to provide immediate responses to the reviewers' questions. The Education official who manages the standards and assessments peer review process recognized that the lack of communication, such as a state not understanding how to interpret peer reviewers' comments, created confusion. Two experts we interviewed about peer review processes in general said that communication between reviewers and state officials is critical to having an efficient process that avoids miscommunication and unnecessary work. State officials said that the peer review process was extensive and that miscommunication made it more challenging. In response to states' concerns, Education has taken steps to improve the peer review process by offering states the option of having greater communication with reviewers after the peer review process is complete. However, the department has not taken action to allow direct communication between states and peer reviewers during the process to ensure a quick resolution to questions or issues that arise, preferring to continue its reliance on Education staff to relay information between states and peer reviewers and protecting the anonymity of the peer reviewers. Reasons for Key Decisions Stemming from Education's Peer Review Process Were Not Communicated to States: In some cases, the final approval decisions made by Education, which has final decision-making authority, differed from the peer reviewers' written comments, but Education could not tell us how often this occurred. Education's panels assessed each state's assessment system using the same guidelines used by the peer reviewers, and agency officials told us that peer reviewers' comments carried considerable weight in the agency's final decisions. However, Education officials said that--in addition to peer reviewers' comments--they also considered other factors in determining whether a state should receive full approval, including the time needed by the state to come into compliance and the scope of the outstanding issues. Education and state officials told us that, in some cases, Education reached different decisions than the peer reviewers. For example, the Education official who manages the standards and assessments peer review process described a situation in which the state was changing its content standards and frequently submitting new documentation for its mathematics assessment as the new content standards were incorporated. Education officials told us the peer reviewers got confused by the documentation, but Education officials gave the state credit for the most recent documentation. However, Education could not tell us how often the agency's final decisions matched the written comments of the peer reviewers because it did not track this information. In cases in which Education's final decisions differed from the peer reviewers' comments, Education did not explain to states why it reached its decisions. Although Education released the official decision letters describing reasons that states had not been approved through peer review, the letters did not document whether their decisions differed from the peer reviewers' comments or why their decisions were different. Because Education did not communicate this to states, it was unclear to states how written peer reviewer comments related to Education's decisions about peer review approval. For example, in our survey, one state reported that the comments provided to the state by peer reviewers and the letters sent to the state by Education describing their final decisions about approval status did not match. State officials we interviewed reported confusion about what issues needed to be addressed to receive full approval of their assessment system. For example, some state officials reported confusion about how to receive final peer review approval when the written summary of the peer review comments differed from the steps necessary to receive full approval that were outlined in the official decision letters from Education. The Education official who manages the standards and assessments peer review process said that in some cases the differences between decision letters and peer reviewers' written comments led to state officials being unclear about whether they were required to address the issues in Education's decision letters, comments from peer reviewers, or both. Conclusions: NCLBA set lofty goals for states to work toward having all students reach academic proficiency by 2013-2014, and Congress has provided significant funding to assist states. NCLBA required a major expansion in the use of student assessments, and states must measure higher order thinking skills and understanding with these assessments. Education currently reviews states' adherence to NCLBA standards and assessment requirements through its peer review process in which the agency examines evidence submitted by each state that is intended to show that state standards and assessment systems meet NCLBA requirements. However, ESEA, as amended, prohibits federal approval or certification of state standards. Education reviews the procedures that states use to develop their standards, but does not review the state standards on which ESEA assessments are based or evaluate whether state assessments cover highly cognitively complex content. As a result, there is no assurance that states include highly cognitively complex content in their assessments. Although Education does not assess whether state assessments cover highly complex content, Education's peer review process does examine state assessment security procedures, which are critical to ensuring that assessments are valid and reliable. In addition, the security of ESEA assessments is critical because these assessments are the key tool used to hold schools accountable for student performance. However, Education has not made assessment security a focus of its peer review process and has not incorporated best practices in assessment security into its peer review protocols. Unless Education takes advantage of forthcoming best practices that include assessment security issues, incorporates them into the peer review process, and places proper emphasis on this important issue, some states may continue to rely on inadequate security procedures that could affect the reliability and validity of their assessment systems. State ESEA assessment systems are complex and require a great deal of time and effort from state officials to develop and maintain. Due to the size of these systems, the peer review process is an extensive process that also took a great deal of time and effort on the part of state officials. However, because Education, in an attempt to maintain peer reviewer confidentiality, does not permit direct communication between state officials and peer reviewers, miscommunication may have resulted in some states spending more time than necessary clarifying issues and providing additional documentation. While Education officials told us the assurance of anonymity was an important factor in their ability to recruit peer reviewers, anonymity should not automatically preclude communications between state officials and peer reviewers during the peer review process. For example, technological solutions could be used to retain anonymity while still allowing for direct communications. Direct communication between reviewers and state officials during the peer review process could reduce the amount of time and effort required of both peer reviewers and state officials. The standards and assessments peer review is a high-stakes decision- making process for states. States that do not meet ESEA requirements for their standards and assessments systems can ultimately lose federal Title I, Part A funds. Transparency is a critical element for ensuring that decisions are fully understood and peer review issues are addressed by states. However, because critical Education decisions about state standards and assessments systems sometimes differed from peer reviewers' written comments, but the reasons behind these differences were not communicated to states, states were confused about the issues they needed to address. Recommendations for Executive Action: To help ensure the validity and reliability of ESEA assessments, we recommend that the Secretary of Education update Education's peer review protocols to incorporate best practices in assessment security when they become available in spring 2010. To improve the efficiency of Education's peer review process, the Secretary of Education should develop methods for peer reviewers and states to communicate directly during the peer review process so questions that arise can be addressed quickly. For example, peer reviewers could be assigned a generic e-mail address that would allow them to remain anonymous but still allow them to communicate directly with states. To improve the transparency of its approval decisions pertaining to states' standards and assessment systems and help states understand what they need to do to improve their systems, in cases where the Secretary of Education's peer review decisions differed from those of the reviewers, the Secretary should explain why they differed. Agency Comments and Our Evaluation: We provided a draft of this report to the Secretary of Education for review and comment. Education's comments are reproduced in appendix VII. In its comments, Education recognizes the value of test security practices in maintaining the validity and reliability of states' assessment systems. However, regarding our recommendation to incorporate test security best practices into the peer review protocols, Education indicated that it believes that its current practices are sufficient to ensure that appropriate test security policies and procedures are implemented. Education officials indicated that states currently provide the agency with evidence of state statutes, rules of professional conduct, administrative manuals, and memoranda that address test security and reporting of test irregularities. Education officials also stated that additional procedures and requirements, such as security methods and techniques to uncover testing irregularities, are typically included in contractual agreements with test publishers or collective bargaining agreements and that details on these additional provisions are best handled locally based on the considerations of risk and cost. Furthermore, Education stated that it plans to continue to monitor test security practices and to require corrective action by states they find to have weak or incomplete test security practices. As stated in our conclusions, we continue to believe that Education should incorporate forthcoming best practices, including assessment security issues into the peer review process. Otherwise, some states may continue to rely on inadequate security procedures, which could ultimately affect the reliability and validity of their assessment systems. Education agreed with our recommendations to develop methods to improve communication during the review process and to identify for states why its peer review decisions in some cases differed from peer reviewers' written comments. Education officials noted that the agency is considering the use of a secure server as a means for state officials to submit questions, documents, and other evidence to strengthen communication during the review process. Education also indicated that it will conduct a conference call prior to upcoming peer reviews to clarify why the agency's approval decisions in some cases differ from peer reviewers' written comments. Education also provided technical comments that we incorporated into the report as appropriate. We are sending copies of this report to appropriate congressional committees, the Secretary of Education, and other interested parties. In addition, the report will be available at no charge on GAO's Web site at [hyperlink, http://www.gao.gov]. Please contact me at (202) 512- 7215 if you or your staff have any questions about this report. Contact points for our Offices of Congressional Relations and Public Affairs may be found on the last page of this report. Other major contributors to this report are listed in appendix VIII. Sincerely yours, Signed by: Cornelia M. Ashby: Director, Education, Workforce, and Income Security Issues: [End of section] Appendix I: Objectives, Scope, and Methodology: The objectives of this study were to answer the following questions: (1) How have state expenditures on assessments required by the Elementary and Secondary Education Act of 1965 (ESEA) changed since the No Child Left Behind Act of 2001 (NCLBA) was enacted in 2002, and how have states spent funds? (2) What factors have states considered in making decisions about question (item) type and content of their ESEA assessments? (3) What challenges, if any, have states faced in ensuring the validity and reliability of their ESEA assessments? (4) To what extent has the U.S. Department of Education (Education) supported and overseen state efforts to comply with ESEA assessment requirements? To meet these objectives, we used a variety of methods, including document reviews of Education and state documents, a Web-based survey of the 50 states and the District of Columbia, interviews with Education officials and assessment experts, site visits in four states, and a review of the relevant federal laws and regulations. The survey we used was reviewed by several external reviewers, and we incorporated their comments as appropriate. We conducted this performance audit from August 2008 through September 2009 in accordance with generally accepted government auditing standards. Those standards require that we plan and perform the audit to obtain sufficient, appropriate evidence to provide a reasonable basis for our findings and conclusions based on our audit objectives. We believe that the evidence obtained provides a reasonable basis for our findings and conclusions based on our audit objectives. Providing Information on How State Expenditures on Assessments Have Changed Since the Enactment of NCLBA and How States Have Spent Funds: To learn how state expenditures for ESEA assessments have changed since NCLBA was enacted in 2002 and how states spent these funds, we analyzed responses to our state survey, which was administered to state assessment directors in January 2009. In the survey, we asked states to provide information about the percentage of their funding from federal and state sources, their use of contractors, cost and availability of human resources, and rank order cost of assessment activities. The survey used self-administered, electronic questionnaires that were posted on the Internet. We received responses from 49 states,[Footnote 31] for a 96 percent response rate. We did not receive responses from New York and Rhode Island. We reviewed state responses and followed up by telephone and e-mail with states for additional clarification and obtained corrected information for our final survey analysis. Nonresponse is one type of nonsampling error that could affect data quality. Other types of nonsampling error include variations in how respondents interpret questions, respondents' willingness to offer accurate responses, and data collection and processing errors. We included steps in developing the survey, and collecting, editing, and analyzing survey data to minimize such nonsampling error. In developing the Web survey, we pretested draft versions of the instrument with state officials and assessment experts in various states to check the clarity of the questions and the flow and layout of the survey. On the basis of the pretests, we made slight to moderate revisions of the survey. Using a Web-based survey also helped remove error in our data collection effort. By allowing state assessment directors to enter their responses directly into an electronic instrument, this method automatically created a record for each assessment director in a data file and eliminated the need for and the errors (and costs) associated with a manual data entry process. In addition, the program used to analyze the survey data was independently verified to ensure the accuracy of this work. We also conducted site visits to four states--Maryland, Rhode Island, South Dakota, and Texas--that reflect a range of population size and results on Education's assessment peer review. On these site visits we interviewed state officials, officials from two districts in each state--selected in consultation with state officials to cover heavily- and sparsely-populated areas--and technical advisors to each state. Identifying Factors That States Have Considered in Making Decisions about Item Type and Content of Their Assessments: To gather information about factors states consider when making decisions about the item type and content of their assessments, we analyzed survey results. We asked states to provide information about their use of item types, including the types of items they use for each of their assessments (e.g., general, alternate, modified achievement standards, or native language), and changes in their relative use of multiple choice and open/constructed response items and factors influencing their decisions on which item types to use for reading/ language arts and mathematics general assessments. We interviewed selected state officials and state technical advisors. We also interviewed officials from other states that had policies that helped address the challenge of including cognitively-complex content in state assessments. We interviewed four major assessment vendors to provide us a broad perspective of the views of the assessment industry. Vendors were selected in consultation with the Association of American Publishers because its members include the major assessment vendors states have contracted with for ESEA assessment work. We reviewed studies that our site visit states submitted as evidence for Education's peer review approval process to document whether assessments are aligned with academic content standards, including the level of cognitive complexity in standards and assessments. We also spoke with representatives from three alignment organizations that states most frequently hire to conduct this type of study, and representatives of a fourth alignment organization that was used by one of our site visit states, who provided a national perspective on the cognitive complexity of assessment content. In addition, we reviewed selected academic research studies that examined the relationship between assessments and classroom curricula using GAO's data reliability tests. We determined that the results of these research studies were sufficiently valid and reliable for the purposes of our work. Describing Challenges, If Any, That States Have Faced in Ensuring the Validity and Reliability of Their ESEA Assessments: To gather information about challenges states have faced in ensuring validity and reliability, we used our survey to collect information about state capacity and technical quality issues associated with assessments. We conducted reviews of state documents, such as assessment security protocols, and interviewed state officials. We asked state officials from the states we visited to complete a CCSSO checklist on student assessment--the Quality Control Checklist for Processing, Scoring, and Reporting--to show which steps they took to ensure quality control in high-stakes assessment programs. We used this specific document created by CCSSO because, as an association of public education officials, the organization provides considerable technical assistance to states on assessment. We confirmed with CCSSO that the document is still valid for state assessment programs and has not been updated. We also interviewed four assessment vendors and assessment security experts that were selected based on the extent of their involvement in statewide assessments. We also reviewed summaries of the peer review issues for states that have not yet been approved through the peer review process, the portion of peer review protocols that address assessment security, and the assessment security documents used to obtain approval in our four site visit states. Describing the Extent to Which Education Has Supported State Efforts to Comply with ESEA Assessment Requirements: To address the extent of Education's support of ESEA assessment implementation, we reviewed Education guidance, summaries of Education assistance, peer review training documents, and previous GAO work on peer review processes. In addition, we analyzed survey results. We asked states to provide information on the federal role in state assessments, including their perspectives on technical assistance offered by Education and Education's peer review process. We also asked peer reviewers to provide their perspectives on Education's peer review process. Of the 76 peer reviewers Education provided us, we randomly sampled 20 and sent them a short questionnaire asking about their perspectives on the peer review process. We obtained responses from nine peer reviewers. In addition, we interviewed Education officials in charge of the peer review and assistance efforts. [End of section] Appendix II: Student Population Assessed on ESEA Assessments in School Year 2007-08: General Reading/Language Arts Assessment; Approximate number of students assessed: 25 million in each of reading/language arts and mathematics in 49 states reporting. Alternate Reading/Language Arts Assessment Using Alternate Achievement Standards; Approximate number of students assessed: 250,000 in each of reading/language arts and mathematics in 48 states reporting. Alternate Reading/Language Arts Assessment Using Modified Achievement Standards; Approximate number of students assessed: 200,000 in each of reading/language arts and mathematics in 46 states reporting. Source: GAO. [End of section] Appendix III: Validity Requirements for Education's Peer Review: Education's guidance describes the evidence states needed to provide during the peer review process. These are: 1. Evidence based on test content (content validity). Content validity is the alignment of the standards and the assessment. 2. Evidence of the assessment's relationship with other variables. This means documenting the validity of an assessment by confirming its positive relationship with other assessments or evidence that is known or assumed to be valid. For example, if students who do well on the assessment in question also do well on some trusted assessment or rating, such as teachers' judgments, it might be said to be valid. It is also useful to gather evidence about what a test does not measure. For example, a test of mathematical reasoning should be more highly correlated with another math test, or perhaps with grades in math, than with a test of scientific reasoning or a reading comprehension test. 3. Evidence based on student response processes. The best opportunity for detecting and eliminating sources of test invalidity occurs during the test development process. Items need to be reviewed for ambiguity, irrelevant clues, and inaccuracy. More direct evidence bearing on the meaning of the scores can be gathered during the development process by asking students to "think-aloud" and describe the processes they "think" they are using as they struggle with the task. Many states now use this "assessment lab" approach to validating and refining assessment items and tasks. 4. Evidence based on internal structure. A variety of statistical techniques have been developed to study the structure of a test. These are used to study both the validity and the reliability of an assessment. The well-known technique of item analysis used during test development is actually a measure of how well a given item correlates with the other items on the test. A combination of several statistical techniques can help to ensure a balanced assessment, avoiding, on the one hand, the assessment of a narrow range of knowledge and skills but one that shows very high reliability, and on the other hand, the assessment of a very wide range of content and skills, triggering a decrease in the consistency of the results. In validating an assessment, the state must also consider the consequences of its interpretation and use. States must attend not only to the intended effects, but also to unintended effects. The disproportional placement of certain categories of students in special education as a result of accountability considerations rather than appropriate diagnosis is an example of an unintended--and negative-- consequence of what had been considered proper use of instruments that were considered valid. [End of table] Source: NCLB Standards and Assessments Peer Review Guidance. [End of section] Appendix IV: Reliability Requirements for Education's Peer Review: The traditional methods of portraying the consistency of test results, including reliability coefficients and standard errors of measurement, should be augmented by techniques that more accurately and visibly portray the actual level of accuracy. Most of these methods focus on error in terms of the probability that a student with a given score, or pattern of scores, is properly classified at a given performance level, such as "proficient." For school-level or district-level results, the report should indicate the estimated amount of error associated with the percent of students classified at each achievement level. For example, if a school reported that 47 percent of its students were proficient, the report might say that the reader could be confident at the 95 percent level that the school's true percent of students at the proficient level is between 33 percent and 61 percent. Furthermore, since the focus on results in a Title I context is on improvement over time, the report should also indicate the accuracy of the year-to-year changes in scores. Source: NCLB Standards and Assessments Peer Review Guidance. [End of section] Appendix V: Alignment Requirements for Education's Peer Review: To ensure that its standards and assessments are aligned, states need to consider whether the assessments: * Cover the full range of content specified in the state's academic content standards, meaning that all of the standards are represented legitimately in the assessments. * Measure both the content (what students know) and the process (what students can do) aspects of the academic content standards. * Reflect the same degree and pattern of emphasis apparent in the academic content standards (e.g., if the academic content standards place a lot of emphasis on operations, then so too should the assessments). * Reflect the full range of cognitive complexity and level of difficulty of the concepts and processes described, and depth represented, in the state's academic content standards, meaning that the assessments are as demanding as the standards. * Yield results that represent all achievement levels specified in the state's academic achievement standards. Source: NCLB Standards and Assessments Peer Review Guidance. [End of section] Appendix VI: Item Types Used Most Frequently by States on General and Alternate Assessments: [Refer to PDF for image: series of horizontal bar graphs] Subject studies: General reading/language arts: Multiple choice: Number of survey respondents: Number of states that use this item type: 48; Number of states that responded to the question: 48; Number of states that did not respond or checked “no response”: 1. Open/constructed response: Number of survey respondents: Number of states that use this item type: 38; Number of states that responded to the question: 45; Number of states that did not respond or checked “no response”: 4. Work samples/portfolio: Number of survey respondents: Number of states that use this item type: 2; Number of states that responded to the question: 41; Number of states that did not respond or checked “no response”: 8. Subject studies: General math: Multiple choice: Number of survey respondents: Number of states that use this item type: 48; Number of states that responded to the question: 48; Number of states that did not respond or checked “no response”: 1. Open/constructed response: Number of survey respondents: Number of states that use this item type: 34; Number of states that responded to the question: 45; Number of states that did not respond or checked “no response”: 4. Other format[A]; Number of survey respondents: Number of states that use this item type: 5; Number of states that responded to the question: 38; Number of states that did not respond or checked “no response”: 11. Subject studies: Alternate assessment using alternate achievement standards reading/language arts: Multiple choice: Number of survey respondents: Number of states that use this item type: 9; Number of states that responded to the question: 40; Number of states that did not respond or checked “no response”: 9. Rating scales: Number of survey respondents: Number of states that use this item type: 13; Number of states that responded to the question: 38; Number of states that did not respond or checked “no response”: 11. Work samples/portfolio: Number of survey respondents: Number of states that use this item type: 26; Number of states that responded to the question: 43; Number of states that did not respond or checked “no response”: 6. Subject studies: Alternate assessment using alternate achievement standards math: Multiple choice: Number of survey respondents: Number of states that use this item type: 9; Number of states that responded to the question: 40; Number of states that did not respond or checked “no response”: 9. Rating scales: Number of survey respondents: Number of states that use this item type: 13; Number of states that responded to the question: 38; Number of states that did not respond or checked “no response”: 11. Work samples/portfolio: Number of survey respondents: Number of states that use this item type: 26; Number of states that responded to the question: 43; Number of states that did not respond or checked “no response”: 6. Source: GAO survey. [A] Other format includes gridded response, performance event, scaffolded multiple choice and performance events, and locally- developed formats. [End of figure] [End of section] Appendix VII: Comments from the U.S. Department of Education: United States Department Of Education: Office Of Elementary And Secondary Education: The Assistant Secretary: 404 Maryland Ave., S.W. WASHINGTON, DC 20202: [hyperlink, http://www.ed.gov] "Our mission is to ensure equal access to education and to promote educational excellence throughout the nation." September 8, 2009: Ms. Cornelia M. Ashby: Director: Education, Workforce, and Income Security Issues: U.S. Government Accountability Office: 441 G Street, NW: Washington, DC 20548: Dear Ms. Ashby: I am writing in response to your request for comments on the draft Government Accountability Office (GAO) report, "No Child Left Behind Act: Enhancements in the Department of Education's Review Process Could Improve State Academic Assessments" (GAO-09-911). This report has three recommendations for the Secretary of Education. Following is the Department's response. Recommendation: Incorporate test security best practices into the peer review protocols. Response: The Department recognizes the value of this recommendation and the importance of test security practices in maintaining the validity and reliability of each State's assessment system. Currently, as part of the peer review process, States do provide us with evidence of State statutes, rules of professional conduct, administrative manuals, and memoranda that address test security and reporting of test irregularities. Other procedures and requirements (e.g., remedies for teacher misconduct) are typically included in contractual agreements with test publishers and other patties, or collective bargaining agreements. The Department does not examine those additional provisions because we believe that our current practices are sufficient to ensure that appropriate test security policies and procedures are promulgated and implemented at the State level. Details on these additional provisions such as security methods and techniques to discover testing irregularities are best handled locally based on consideration of risk and cost factors. As the report mentions, the Department monitors the implementation of State test security policies in its regularly scheduled Title I monitoring visits to State and local educational agencies. Department staff will continue to monitor test security practices during the monitoring visits, issue findings to States with weak or incomplete test security practices, and require corrective action by States with monitoring findings. Recommendation: Develop methods to improve communication during the review process. Response: The Department has made the following improvements over the last year to improve communications with States during the peer review process. First, peers and the Department staff member assigned to review the State's assessment system typically call State assessment officials and discuss the submission and the peers' concerns. This occurs prior to the conclusion of the peer review, giving peers time to correct any misconceptions before they complete their review. Through this process, State officials have opportunities to ask questions and obtain clarification regarding the peers' and Department's concerns. Second, during the Technical Assistance Peer Review (May 2008), State assessment professionals (individuals or teams) met directly with the peers or peer team leader and the Department staff member assigned to the State to thoroughly discuss the peers' comments and concerns. A technical assistance review is conducted to help States understand where further development is required before the system is ready for review. The Department will continue this process. Furthermore, the Department is looking into the possibility of using a secure server as a means for State officials to submit questions, documents, and other evidence that would only be viewed by the reviewers, State officials, and Department staff We believe that the use of a secure server, in combination with the procedures already in place, would strengthen the communication that takes place during the peer review process. Recommendation: Identify for States why its peer review decisions in some cases differed from peer reviewers' written comments. Response: Peer notes sometimes address areas outside of the Department's purview, offer recommendations to improve elements of the system beyond the requirements of the law and regulations, or offer opinions on technical matters. We do not use those recommendations in judging the merits of the assessment system, but, as a professional courtesy, we include them as technical assistance in the peer notes provided to the States. Peer notes, and the deliberations they document, are recommendations to the Assistant Secretary for Elementary and Secondary Education, and on occasion, Department staff may disagree with the peers' summary comments. The Assistant Secretary is presented with these discrepancies after they have been discussed internally among Department staff. These discrepancies usually deal with limits on the range of evidence that is required to be provided to demonstrate compliance with the applicable statutory and regulatory provisions and the extent to which the Department has authority in judging the quality of certain features of a State assessment system. For example, the Department has no prerogative to deny approval of an assessment system based on the substance of content standards nor is the State required to submit evidence on that issue. The Department and peers review only the process used to develop a State's content standards, ensure broad participation of stakeholders in the process, and ensure that a State demonstrates the rigor of the standards. Hence, there are no peer- review "decisions," only peer recommendations reflecting the professional experience and perspectives of the reviewers. The Assistant Secretary takes these recommendations under consideration, along with those of Department staff, in making a decision regarding the approval of a State's assessment system. However, in response to this recommendation, Department staff will conduct a conference call in advance of upcoming peer reviews to clarify why the Department's decisions in some cases differ from peer reviewers' written comments. I appreciate the opportunity to share our comments on the draft report. I hope that these comments are useful to you. In addition, we have provided some suggested technical edits that should be considered to add clarity to the report. Sincerely, Signed by: Thelma Melendez de Santa Ana, Ph.D. [End of section] Appendix VIII: GAO Contact and Staff Acknowledgments: GAO Contact: Cornelia M. Ashby (202) 512-7215 or ashbyc@gao.gov: Staff Acknowledgments: Bryon Gordon, Assistant Director, and Scott Spicer, Analyst-in-Charge, managed this assignment and made significant contributions to all aspects of this report. Jaime Allentuck, Karen Brown, and Alysia Darjean also made significant contributions. Additionally, Carolyn Boyce, Doreen Feldman, Cynthia Grant, Sheila R. McCoy, Luann Moy, and Charlie Willson aided in this assignment. [End of section] Footnotes: [1] For purposes of this report, the term "ESEA assessments" refers to assessments currently required under ESEA, as amended. The Improving America's Schools Act of 1994 created some requirements for assessments, and these requirements were later supplemented by the requirements in NCLBA. [2] For purposes of this report, we refer to test questions as "items." The term item can include multiple choice, open/constructed response, and various other types, while the term "question" connotes the usage of a question mark. [3] New York and Rhode Island did not respond to the survey. For the purposes of this report, we refer to the District of Columbia as a state. [4] Pub. L. No. 89-10. [5] Pub. L. No. 103-382. [6] Pub. L. No. 107-110. [7] GAO, Title I: Characteristics of Tests Will Influence Expenses; Information Sharing May Help States Realize Efficiencies, GAO-03-389 (Washington, D.C.: May 2003). [8] Pub. L. No. 111-5. [9] Adequate Yearly Progress is a measure of year-to-year student achievement under ESEA. AYP is used to make determinations about whether or not schools or school districts have met state academic proficiency targets. All schools and districts are expected to reach 100 percent proficiency by the 2013-14 school year. [10] For the total number of students tested on each of the different types of assessment in 2007-08, see appendix II. [11] The 2 percent of the scores being included in AYP using the alternate assessment based on modified academic achievement standards is in addition to the one percent of the student population included with the alternate assessment based on alternate academic achievement standards. [12] LEP students may only take assessments in their native language for a limited number of years. [13] GAO's 2003 report (GAO-03-389) found that item type has a major influence on overall state expenditures for assessments. However, regarding the changes to state expenditures for assessments since the enactment of NCLBA--which our survey examined--few states reported that item type was a major factor. [14] We asked states to rank the cost of test/item development, scoring, administration, reporting test results, data management, and all other assessment activities. [15] Although GAO-03-389 found that item type was a key factor in determining the overall cost of state ESEA assessments, these differences were related to the cost of scoring assessments rather than developing assessments. Our research did not find that item type affected the cost of development. [16] We defined small states as those states administering 500,000 or fewer ESEA assessments in 2007-08. Reading/language arts and mathematics assessments were counted separately. [17] GAO, Title I: Characteristics of Tests Will Influence Expenses; Information Sharing May Help States Realize Efficiencies, [hyperlink, http://www.gao.gov/products/GAO-03-389] (Washington, D.C.: May 8, 2003). [18] This does not necessarily indicate that state assessments were not aligned to state standards. For example, if the content in standards does not include the highest cognitive level, assessments that do not address the highest cognitive level could be aligned to standards. [19] The alignment review was conducted by Achieve, Inc., which was one of the four alignment organizations that we interviewed. [20] Committee on the Foundations of Assessment, James W. Pellegrino, Naomi Chudowsky, and Robert Glaser, editors, Knowing What Students Know: The Science and Design of Educational Assessment (Washington, D.C.: National Academy Press, 2001) 194. [21] Gail P. Baxter and Robert Glaser, "Investigating the Cognitive Complexity of Science Assessments," Educational Measurement: Issues and Practice, vol. 17, no. 3 (1998). [22] Helen S. Apthorp, et al., "Standards in Classroom Practice Research Synthesis," Mid-Continent Research for Education and Learning (October 2001). [23] For example, see 20 U.S.C. § 7907(c)(1) and 20 U.S.C. § 6575. [24] Maryland did not complete this checklist. [25] CCSSO is an association of public officials who head departments of elementary and secondary education in the states, the District of Columbia, the Department of Defense Education Activity, and five extra- state jurisdictions. It provides advocacy and technical assistance to its members. The CCSSO checklist describes 33 steps that state officials should take to ensure quality control in assessment programs that are used to make decisions with consequences for students or schools. The checklist can be found at [hyperlink, http://www.ccsso.org]. [26] U.S. Department of Education, Office of the Inspector General, Tennessee Department of Education Controls Over State Assessment Scoring, ED-OIG/A02I0034 (New York, N.Y.: May 2009). [27] The 12 states that had not received full approval were California, the District of Columbia, Florida, Hawaii, Michigan, Mississippi, Nebraska, Nevada, New Hampshire, New Jersey, Vermont, and Wyoming. In all of these states except California the alternate assessments based on alternate achievement standards were a factor preventing full approval. [28] The four states were Florida, New Hampshire, New Jersey, and Vermont. [29] Brian A. Jacob and Steven D. Levitt, "Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher Cheating," The Quarterly Journal of Economics (August 2003). [30] American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Standards for Education and Psychological Testing (1999). [31] In this report, we refer to the District of Columbia as a state. [End of section] GAO's Mission: The Government Accountability Office, the audit, evaluation and investigative arm of Congress, exists to support Congress in meeting its constitutional responsibilities and to help improve the performance and accountability of the federal government for the American people. GAO examines the use of public funds; evaluates federal programs and policies; and provides analyses, recommendations, and other assistance to help Congress make informed oversight, policy, and funding decisions. GAO's commitment to good government is reflected in its core values of accountability, integrity, and reliability. Obtaining Copies of GAO Reports and Testimony: The fastest and easiest way to obtain copies of GAO documents at no cost is through GAO's Web site [hyperlink, http://www.gao.gov]. Each weekday, GAO posts newly released reports, testimony, and correspondence on its Web site. To have GAO e-mail you a list of newly posted products every afternoon, go to [hyperlink, http://www.gao.gov] and select "E-mail Updates." Order by Phone: The price of each GAO publication reflects GAO’s actual cost of production and distribution and depends on the number of pages in the publication and whether the publication is printed in color or black and white. Pricing and ordering information is posted on GAO’s Web site, [hyperlink, http://www.gao.gov/ordering.htm]. Place orders by calling (202) 512-6000, toll free (866) 801-7077, or TDD (202) 512-2537. Orders may be paid for using American Express, Discover Card, MasterCard, Visa, check, or money order. Call for additional information. To Report Fraud, Waste, and Abuse in Federal Programs: Contact: Web site: [hyperlink, http://www.gao.gov/fraudnet/fraudnet.htm]: E-mail: fraudnet@gao.gov: Automated answering system: (800) 424-5454 or (202) 512-7470: Congressional Relations: Ralph Dawn, Managing Director, dawnr@gao.gov: (202) 512-4400: U.S. Government Accountability Office: 441 G Street NW, Room 7125: Washington, D.C. 20548: Public Affairs: Chuck Young, Managing Director, youngc1@gao.gov: (202) 512-4800: U.S. Government Accountability Office: 441 G Street NW, Room 7149: Washington, D.C. 20548: