Archives for the month of: March, 2013

A teacher writes:

I went to a meeting today and had my eyes opened – WIDE. As a teacher at a ruraL schoo, we are a little behind on this VAM thing. Needless to say at district wide meetings I get to meet teachers from all over, some from much larger suburban districts. They already have their pay based on students’ and the school’s improvement on tests.
WELL, the teachers have figured the whole student improvement thing out- DISTRACT THE LIVING DAYLIGHTS OUT OF THE STUDENTS DURING THE BEGINNING OF THE YEAR TEST – Yep, that is exactly what many of them are doing. Play music, talk on the phone, talk to other teachers very loudly, clean the room…do what ever you can to lower your students scores at the beginning and then have a silent, well ordered room, with hints everywhere at the end of the year, Success. The test isn’t fair, why should teachers have to play fair

This is one of Gary Rubinstein’s most powerful posts.

He analyzes a series of TFA videos that are shameless propaganda for the view that high expectations overcome poverty and that TFA has cracked the secret code of education.

This sort of rhetoric reveals the basic sin of TFA. The organization encourages policymakers to believe that they don’t need to do anything to reduce poverty. Maybe that’s why the corporate giants and rightwing foundations such as Walton love TFA.

No new taxes, just higher expectations from teachers with five weeks of training.

EduShyster has developed a list of 10 signs of a real, true Transphormer. You know, the ones who are so motivated to arrange the lives of other people’s children that they can’t wait to get their parents’ okay. The ones who are so gripped by a sense of urgency that they feel called to close schools in poor communities and fire the staff without a moment’s delay, even though the students and parents beg them not to do it.

A reader sends this information about the Néw England assessment:

In terms of cut scores, this data from New Hampshire puts things in perspective:

2010-2011 graduation rate: 86% (3rd in US)
2011 Science and Engineering Readiness Index (SERI) ranking: 4th in US

2011 NAEP Mathematics:
4th grade scale score state rank: 2nd
4th grade % of students below basic: 8% (2nd)
8th grade scale score state rank: 6th
8th grade % of students below basic: 18% (4th)

2009 NAEP 11th Grade Mathematics Pilot (of 11 participating states)
Scale score state rank: 2nd of 11
% of students scoring below basic: 26% (3rd)

2012 NECAP Mathematics
4th grade % of students substantially below proficient: 8%
8th grade % of students substantially below proficient: 15%
11th grade % of students substantially below proficient: 36%

Percentage of current New Hampshire high school juniors who would be at risk of not graduating under Rhode Island’s 2014 requirements: 36%

Teachers at a charter school in Louisiana received eye-popping bonuses.

One got a bonus of $43,000–more than 75% of her annual salary–for raising test scores by 88% in one year.

Five teachers shared bonuses of $167,000,

The money comes from a federal grant.

One teacher saw a gain of nearly 200%, but she teaches kindergarten, so she received only $4,086.

The school got a grade of D from the state. Last year, it was D-.

The scores, the grades, the gains, the bonuses. Are the children better educated? Who knows?

In other districts, gains of this size usually are grounds for an investigation. But this is Louisiana, so forget about it.

As we saw in the previous postTom Sgouros explained in detail why it was wrong for Rhode Island to use the NECAP as a graduation requirement. It was not designed for that purpose, and many students will fail who should have passed.

State Commissioner of Education Deborah Gist said Sgouros was wrong because he is not a psychometrician. She did not explain why he was wrong, not did she understand that psychometricians would likely agree with Sgouros. The cardinal rule of testing is that tests should be used only for the purpose for which they were designed.

Here is Sgouros’ account (if I hear from Gist, I will print hers):

Gist Offers Logical Fallacies On NECAP Value

By Tom Sgouros on March 20, 2013

I was on the radio ever so briefly this afternoon, on Buddy Cianci’s show with Deborah Gist. Unfortunately, the show’s producer hadn’t actually invited me so I had no idea until it had been underway for an hour. I gather they had a lively conversation that involved belittling the concerns about the NECAP test that I expressed here.

While I was on hold, I had to get on a bus in order not to leave my daughter waiting for me in the snow. Then Buddy said the bus was too loud but he’d invite me back on. So I was only on for about five minutes, long enough to hear Gist say I may be good at math, but I’m no psychometrician.

Guilty as charged, but somewhat beside the point.

I’ve heard the commissioner speak in public in a few different ways since I published my letter last week. She tweeted about it a couple of times last week and over the weekend. She was quoted in the paper this morning about how it was an “outrageous act of irresponsibility” for adults to take the NECAP 11th grade math test at the Providence Student Union event on Saturday. And today she spent a while on the WPRO airwaves insulting me.

But I have yet to hear any of the points I’ve made taken on directly.

Only what is called the argument from authority: I’m education commissioner and you’re not. Or in this case: I’m education commissioner, and you’re not a psychometrician.

As a style of public argument, this is highly effective, especially if salted with a pinch of condescension. It typically has the effect of shutting down debate right there because after all, who are you to question authority so?

The problem is if you believe, as I do, that policy actually matters, this is a dangerous course to take.

After all, the real point of any policy discussion is not scoring debate points, but finding solutions to the problems that beset us. This is a highly imperfect world we live in, filled with awful problems, some of which we can only address collectively. If you don’t get the policy right, here’s what happens: the problems don’t get solved. Frequently, bad policy makes the problems worse, no matter how many debate points you scored, or how effectively you shut up your opponent.

So, do I care that Deborah Gist thinks I’m an inadequate excuse for a psychometrician? It turns out that, upon deep and lingering introspection, I can say with confidence that I do not. But I do care about the state of math education in Rhode Island, and I believe she has us on a course that will only damage the goal she claims to share with me.

Now I may be wrong about my NECAP concerns, but nothing I’ve learned in the past week has made me less confident in my assessment. On the one hand, I’ve seen vigorous denunciations of the PSU efforts, and mine, none of which have actually addressed the points I’ve raised. These are specific points, easily addressed. On the flip side, I’ve quietly heard from current and former RIDE employees that my concerns are theirs, but the policy is or was not in their hands.

Those points again: there are a few different ways to design a test. You can make a test to determine whether a student has mastered a body of knowledge; you can make a test to rank students against each other; you can make a test to rank students against each other referenced to a particular body of knowledge. I imagine there are lots of other ways to think about testing, but those are the ones in wide use. The first is a subject-matter test, like the French Baccalaureate or the New York State Regents exams. The second is a norm-referenced test like the SAT or GRE, where there are no absolute scores and all students are simply graded against each other on a fairly abstract standard. NECAP is in a third category, where it ranks students, but against a more concrete standard. The Massachusetts MCAS is pretty much the same deal, though it seems to range more widely over subject matter.

The problem comes when you imagine that these are pretty much interchangeable. After all, they all have questions, they all make students sweat, and they all require a number two pencil. How different could they be?

Answer: pretty different. If your goal is ranking students, you choose questions that separate one student from another. You design the test so that the resulting distribution of test scores is wide, which is another way to say that lots of students will flunk such a test. If your goal is assessing whether students have mastered a body of knowledge, the test designer won’t care nearly so much about the resulting distribution of scores, only that the knowledge tested be representative of the field. (The teacher will care about the distribution, of course, since it’s a measure of how well the subject has been taught.) The rest was explained in my post last week.

The real question is, if you don’t know what the NECAP is measuring, why exactly might you think that it’s a good thing to rely on it so heavily as a graduation requirement?

Deborah Gist is hardly the first person to call me wrong about something. That happens all the time, as it does for anybody who writes for the public about policy. But like so many others who claim I am wrong, she refuses to say — or cannot say — why.

Tom Sgouros, an engineer, wrote an open letter to the chair of the Rhode Island Department of Education, explaining succinctly why NECAP should not be used as a graduation test. It was not designed for that purpose. It will fail students who deserve to pass.

Anyone who reads and understands this letter will recognize that using NECAP for a graduation test is educational malpractice. If they persist in the face of clear evidence to harm children, they should resign or be fired.

The open letter:

Open Letter About NECAP To Eva Mancuso

By Tom Sgouros on March 14, 2013

Eva Marie Mancuso, Chair,
Rhode Island Board of Education,
Rhode Island Department of Education,
225 Westminster Street,
Providence RI 02903

Dear Ms. Mancuso:

I read with interest in this morning’s news about the Providence School Board’s suggestion to the Board that you not rely on the NECAP test as a graduation requirement. I would like to second that suggestion, and offer some words of explanation that I believe have been largely absent from the debate until now.

The Providence board points out that the NECAP test was “not designed” to be a graduation requirement. That is quite true, but few go on to say why that makes it inappropriate to use as performance threshold for graduating students.

First, a little about me. I have worked as a freelance engineer and policy analyst for 30 years, and both occupations have required me to acquire an expertise in statistics. I speak not as a statistical layman, but as an expert hoping to translate important concepts for people who may not have deep familiarity with p-values and confidence intervals. I do not wish to condescend, but I am afraid that some basic statistical concepts have not been well understood by policy makers in the past, and consequently decisions have been made that are deeply damaging to our students, and to education in Rhode Island generally.

The important point I wish the board members to understand is what exactly is the difference between a test like NECAP, designed to rank schools and students, and a test designed to evaluate student proficiency. The short version: when you design a test like NECAP, test designers ensure that a certain number of students will flunk. What’s more, for the purposes of the test designers, that’s a good thing.

Here’s the longer version. The original goal of NECAP was to evaluate schools, and, to some extent, students within the schools. In order to make a reliable ranking among schools, you need to ensure that the differences between one school and another (or one student and another) is statistically significant. This is simply how you ensure that the rankings are the result of real differences between schools, and not the result of chance.

A traditional test, such as the final exam a teacher might give to her class at the end of the term, will likely enough have a distribution of grades that looks something like the graph below. (I use a class size of 5000 here. This is obviously a lot of students for a single class, but only a fraction of the number who take the NECAP tests.)

Suppose the teacher set the passing grade at 70, then about 4% of her students failed the class. That’s a shame, but it’s not unusual, and those students will have to take the class again or take the test again or whatever. If the goal is to see which of the students in the class have properly understood the material, this is a useful result.

But if the goal was to rank the students’ performance, this result won’t help much. A very large number of students scored between 80 and 84. In the graph, 1200 students, a quarter of the population, have almost the same score, and 6% of them have exactly the same score, 83. How can you rank them?

Furthermore, like any other measurement, a test score has an inherent error. For any individual student, a teacher can have little confidence that a student who scored an 80 didn’t deserve an 84 because of a bad day, a careless mistake, or, worse, someone else’s error: a misunderstood instruction, an incomplete erasure, or a grading mistake. Of course, any errors could also move the score in the other direction.

The problem is that moving a student’s score from 80 to 84 moves the student from the 18th percentile to the 38th, a huge jump. In other words, a test score might rank a student in the 18th percentile, but one can have no confidence that he or she didn’t belong in the 38th — or the 5th. Conversely, a student in the 92d percentile might really belong in the 69th or the 99th, depending on the same four-point error.

The designers of tests understand this, and so try to avoid ranking students based on the results of tests that give distributions like the above. Instead, they try to design tests so the distribution of scores looks more like the one here:

With a test that gives results like this, there are many fewer students in most of the score ranges here. Assuming the same level of error, you can be much more sure that a student who scored in some percentile belongs there, or nearby. With the same four-point error as above, you can be confident — in the statistical sense — that a student who scored in the 18th percentile on this test belongs somewhere in between the 14th and 22d percentiles, a much smaller range. A student in the 92d percentile belongs somewhere between the 89th and 95th percentile.

In other words, if a test designer wants to rank students, or schools, he or she designs the test to spread the scores out. You don’t want scores to be bunched up. This is confirmed by details provided in the technical manuals that document the test design process. For example, in section 5.1 of the NECAP 2011-2012 technical report (“Classical Difficulty and Discrimination Indices”)

“Items that are answered correctly by almost all students provide little information about differences in student abilities, but do indicate knowledge or skills that have been mastered by most students. Similarly, items that are correctly answered by very few students provide little information about differences in student abilities, but may indicate knowledge or skills that have not yet been mastered by most students.”

This section goes on to discuss how the designers evaluate test items for their capacity to discriminate among students, and demonstrates that most of the questions used in the various NECAP tests do exactly that. In other words, very few of the questions are correctly answered by all students. In Appendix F of the 2011-12 manual, you can see some item-level analyses. There, one can read that, of the 22 test questions analyzed, there are no questions on the 11th grade math test correctly answered by more than 80% of students, and only nine out of 22 were correctly answered by more than half the students.

Contrast this with the other kind of test design. In the first graph above, even the students who flunked the test would have answered around 60% of the questions correctly. The NECAP designers would deem those questions to “provide little information about differences in student abilities.” According to this theory of test design, such questions are a waste of time, except to the extent that they might be included to “ensure sufficient content coverage.” Put another way, if all the students in a grade answered all the questions properly, the NECAP designers would consider that test to be flawed and redesign it so that doesn’t happen. Much of the technical manual, especially chapters 5 and 6 (and most of the appendices), are devoted to demonstrating that the NECAP test is not flawed in this way. Again, the NECAP test is specifically designed to flunk a substantial proportion of students who take it, though this is admittedly a crude way to put it.

11th Grade Math Before leaving the subject of students flunking the NECAP tests, it’s worth taking a moment to consider the 11th grade math test specifically. Once the NECAP test was designed, the NECAP designers convened panels of educators to determine the “cut scores” to be used to delineate “proficiency.” The process is described in appendices to the technical manual:

Standard

Grades 5–8, in Appendix D, 2005-06 report
Grade 11, in Appendix F, 2007-08 report
Grades 5 & 8 Writing in Appendix M, 2010-11 report

After consulting these appendices, you will see that — at the time they were chosen — the cut scores for the 11th grade math test put 46.5% of all test takers in the “substantially below proficient” category (see page 19 of Appendix F 2007-08). This is almost four times as many students as were in that category for the 11th grade reading test and more than twice as many for any other NECAP test in the other grades.

There is no reason to think that the discussions among the panels that came up with these cut scores were not sincere, nor to think that the levels chosen not appropriate. However, it is worth noting that the tests occur almost two years before a student’s graduation, and that math education proceeds in a fundamentally different way than reading. That is, anyone who can read at all can make a stab at reading material beyond their grade level, but you can’t solve a quadratic equation halfway.

Rather than providing a measure of student competence on graduation, the test might instead be providing a measurement of the pace of math education in the final two years of high school. The NECAP test designers would doubtless be able to design questions or testing protocols to differentiate between a good student who hasn’t hit the material yet, or a poor student who shouldn’t graduate, but they were not tasked with doing that, and so did not.

Testing

To be quite clear, I am not an opponent of testing, nor even an opponent of high- stakes testing. The current testing regime has produced a backlash against testing in a general way, but this is a case where bad policy has produced bad politics. It’s hard to imagine running something as complex as a school department in the absence of some kind of indicator of how well one is running it. Since educated students are the output, it is crucial to the success of the overall enterprise that we find some way to measure progress in improving that level of education.

Similarly, high-stakes graduation tests are hardly anathema. Over the past half-century, the entire nation of France has done very well with a high-stakes test at high school graduation. Closer to home, the New York State Regents’ tests are a model that many other states would do well to copy. There is nothing wrong with “teaching to the test” when the test is part of a well-designed and interesting curriculum.

However, if evaluation of progress is the goal, and if you want an accurate measurement of how well a school is doing, there is a vast body of evidence available to say that high stakes testing won’t provide that. When there are severe professional consequences for teachers and school administrators whose classes and schools perform badly on tests, you guarantee that the tests will provide only a cloudy indication of a school’s progress. Teaching to the test is only one of the possible sins. School systems across the country have seen cheating scandals, as well as such interesting strategies as manipulating school lunch menus to improve test performance. In other words, raising the stakes of a test almost certainly makes the test a worse indicator of the very things it is supposed to measure.

Furthermore, a sensible evaluation regime would be minimally intrusive, and take only a small amount of time away from instruction. After all, testing time is time during which no instruction happens. But the imposition of high stakes have rendered that nearly impossible, so instead, we have tests that disrupt several weeks of classes in most school districts, not to mention the disruption to the curriculum it has caused.

Unfortunately for the students of Rhode Island, our state has tried to take the easy way out, and use a test designed for evaluation to serve many purposes. Today, the NECAP test affects the careers of students, teachers, and administrators. It is used in a high-stakes way which guarantees that it is an inaccurate indicator of the very things it is supposed to measure. It is used for purposes far beyond its original design, producing perfectly needless pain and heartbreak across the state.

Worst of all, none of this is news to education professionals. They know how to read technical manuals and to sort through statistical exegeses of test results. They know about the harm done to students by cutting electives to focus on improving reading results. They know about the other corners cut to try to improve test results at all costs. They know that we don’t abuse the NECAP test in order to help students. They know we did this strictly to save money.

I urge you and the new education board to reconsider the state’s use — and abuse — of the NECAP test. It could be a valuable tool with which to understand how to improve education in our state. Unfortunately, poor decisions made in the past have done much to undermine that value, to our state’s detriment, and that of all the students in our schools.

Yours sincerely,

Tom Sgouros

Rating: 10.0/10 (2 votes cast)

Related posts:
NECAP Grad Requirement Trumps Good Grades
Students Call On Chafee To Stop High Stakes Tests
Protest High-Stakes Testing Wednesday At State House
Why High Stakes Tests Shouldn’t Grade Students
Why you can’t simply can’t trust Education Reformers with the facts
Posted in Education, Featured | Tagged Education, gist, mancuso, necap, tsgouros | 5 Responses

Tom Sgouros
Tom Sgouros is a freelance engineer, policy analyst, and writer. Reach him at ripr@whatcheer.net. Buy his book, “Ten Things You Don’t Know About Rhode Island” at whatcheer.net
5 responses to “Open Letter About NECAP To Eva Mancuso”

tom_hoffman March 14, 2013 at 10:51 am | Permalink | Log in to reply.
Good work, Tom. I hadn’t dug into how the cut scores were set, but it is plainly obvious from the results across three states that the 11th grade math is just way, way, harder to pass than the rest of the tests.

Rating: 0 (from 0 votes)

Walt48 March 14, 2013 at 2:18 pm | Permalink | Log in to reply.
I am afraid. Why do I think that we now have a Queen Ghidorah that devoured Rhode Island at hand (Raimondo, Gist and Mancuso)?

Bill Daly March 14, 2013 at 3:44 pm | Permalink | Log in to reply.
Tom, thank you for a well reasoned analysis. School Committees throughout the state should follow the example of the Providence School Committee and register their disatisfaction with the statistically flawed NECAP test. As you point out it was never designed by the test developers to serve as a graduation requirement.

leekhat March 15, 2013 at 8:56 am | Permalink | Log in to reply.
A great letter. I did not know some of that information, but, as an educator who is experiencing some of the pain of state ed reform initiatives, I will be educating all who will listen about NECAP test design and it’s intended use. I am also relieved that the Providence School Committee is confronting this issue and also hope that other school committees give this some thought.
Mark Williams
NKHS

Beleaguered Teacher March 17, 2013 at 11:12 am | Permalink | Log in to reply.
Thank you, Mr. Sgouros, for a cogent and coherent explanation. It gives me hope to know that there are individuals out there do who, indeed, “get it”. May I respectfully suggest that you share it with the newspapers, in order to reach a wider audience?

A Fulton County parent sent me this notice of a meeting today: the second annual Faith Summit “to forge partnerships between schools and the faith community. The free event is for leaders of local houses of worship to join school principals and district leaders in a collaborative discussion on practical ways to provide resources benefiting both schools and houses of worship.”

The parent was disturbed by that. It seems to be part of a larger trend to eliminate the line between public and private, between church and state. We can respect all religions, don’t you think, without bringing religious ideas into the public schools.

I am speaking later today at UNC in Charlotte. I will be speaking at 7 pm at the
Cone McKnight Auditorium – UNCC campus
320 East Ninth St

Tomorrow in Raleigh. Details here.

The National Research Council is conducting a five-year review of mayoral control and the D.C. Public schools.

The committee created for this purpose will meet on March 22.

There is an open session at 1 pm to discuss test security and the validity of test scores in D.C.

This is a good opportunity to listen, learn, and perhaps determine whether the researchers intend to conduct a probe that is more thorough than the cursory reviews of two inspectors general.

Here are the details:

Committee for the Five-Year (2009-2013) Summative Evaluation of the District of Columbia Public Schools

Meeting Two
March 22, 2013
500 Fifth St., NW
Washington DC
DRAFT Agenda

OPEN SESSION

1:00 – 3:00​

Discussion of Test Security and Validity of Achievement Test Data

• Overview of test security issues and session goals

Lorraine McDonnell and Carl Cohn

• Best practices for preventing security violations

Carswell Whitehead, ETS

• Statistical tools for flagging anomalies and issues they raise

Carswell Whitehead, ETS

• Strategies and issues in forensic investigation of possible security violations

Steve Ferrara, Pearson Assessments

• Case Study: What can DC learn from events in other districts?

Heather Vogel, Atlanta Journal Constitution