Despite the glowing hyperbole in the media, Mercedes Schneider says there is nothing new in the results. The study is dated, there is missing data, the effects of the cheating scandal remain unknown, and the investigation of the cheating was turned over to an accounting firm with no experience in investigating cheating. Mercedes is not impressed.
Rhee is going to be on CNN tonight to discuss the results of this non-study. I think it is very important that all who read this blog get the word out that these new claims are complete bunk.
I can’t believe they would promote this junk on CNN
Dee Dee:
Why do you say that it is a junk study? Have you actually looked at it? It is fair enough to point out its possible flaws but to call it a junk study requires some real evidence.
It seems to me that calling the paper obsolete is an exaggeration. The addition of a fifth category in the system and the reduction of the importance of test scores in teacher evaluations has no impact on the main conclusion that if you give teachers incentives to change their practices in a certain way they will change their practices.
Both the categories and the classification system are changed.
The study cannot be extended. It is that simple.
“…if you give teachers incentives to change their practices in a certain way they will change their practices.”
This conclusion is too simplistic, especially given the fact that data quality is in question (cheating never resolved). How are teachers “changing theur practices,” exactly? Ask Adell Cothorne her opinion on this.
Not all “teachers” are even affected by this so-called evaluation system. TFA escapes. Other teachers are effected; a major “change” that they make is to leave; that leaves room for more TFAers to enter– the temp teachers escape this system that is highly oppressive to career teachers.
affected
As the authors of the paper said, the conclusions still hold if you eliminate the group one teachers from the data pool (the group 1 teachers are your mysteries “their”), and only group 1 teachers evaluations included test scores.
I would not think the findings here would be very controversial. In other threads folks talk about how important it is to pay teachers more when they get an advanced degree because teachers will not get advanced degrees without the additional salary. This paper is presenting evidence that this is true for other aspects of teacher effort.
What do you mean by extended? Do you think the results of the study are so fragile that slight changes in the evaluation procedure will result in large changes in the behavior of teachers?
There are too many problems with this study to trust results, much less apply them with any confidence. To say that the Dee and Wyckoff paper is proof that IMPACT is “working” is a major stretch. It was a stretch before IMPACT was altered, and it is much more of a stretch now that they system is changed.
The results are founded on questionable data– this is a major problem. The cheating issue has been acknowledged and never resolved during the time of data collection.
The 2013-14 alterations to IMPACT are not “slight.” To remove a faulty school-wide VAM score and to add an extra “probationary” year will “impact” careers. It is easy to term changes as “slight” when such do not affect the livelihood of one outside of this system.
How do you think these changes will change the findings of the paper?
I should take a step back. What do you think the are the results of the study? We should agree on that before we move on to talk about how stable these results might be to the changes in the 2013-14 system.
Teachingec, I am not going to spend my time rehashing what I have written.
I was once both a reviewer and an associate editor of research for a flagship counseling journal. For four years, I made publication decisions on research studies. I would not have forwarded the paper for publication based upon one particular weaknesses I note in the conclusion of my blog post;
“Dee and Wyckoff have written a working paper on a version of IMPACT that is now obsolete. They used data that originated with both a former chancellor and a current chancellor who themselves have not been cleared of involvement in DC’s cheating scandal.”
As a reviewer, I would have sent the paper back asking researchers to rewrite using information on the current state of IMPACT– provided they could show better proof of data integity than that Caveon and Alvarez and marsal were in charge of a highly questionable cheating investigation.
As they stand, the data integrity issues in the Dee and Wyckoff paper are fatal.
As far as continued commenting goes with you regarding this post, I am done.
Mercedes:
Have you shared your concerns with Dee and Wyckof? If so, what were their responses?
It is unfortunate that you do not have time for this discussion. I suspect our differing views of the validity of the paper’s conclusions stem from different understandings of what the conclusion of the study are.
What would be a better operating word then? When you subtly change the incentives (VAM no longer has as much weight + monetary rewards have significantly decreased = a change in the variables), the behavior/outcome is subtly changed as well. Clearly, IMPACT 1.0 is not equal to IMPACT 3.0.
The paper will/would never pass peer review.
The Morrigan:
As I read the paper, it is about whether or not changing incentives via a teacher evaluation process will in fact alter teacher behavior. The fact that the treatment has changed does not in itself alter the potential significance of their findings though it certainly limits any generalization of their findings to the first two years of any similarly designed teacher evaluation process.
FWIW, I have no doubts that this paper will be published in a significant peer reviewed journal. That said, it definitely needs replication and an extensive SI sufficient to allow for such replication by interested parties.
The criteria for peer review is not that the system that generated your data is still in place. NBER papers are generally of very high quality and I expect it will end up being published in a peer reviewed journal.
Postings like this is why I hoped that poster deutsch29 would have had time to discuss what the conclusions of the paper. As I read it, the authors of the paper is interested in seeing if salary differences can create incentives for teachers to change class practices. The existing literature suggested that this would not be the case, but there were reasons to believe that the existing studies were too limited in scope (including the use of test scores as the only measure of teacher changes in class practices). The study found that teacher practices did change to more closely match the practices that were looked for in the evaluation, and teachers that were evaluated poorly on the evaluation were more likely to leave teaching.
It is not clear to me why changes in the details of how the incentive structure is set up would mean that incentive structures suddenly have no impact on teacher practices.
TE:
I think we are in agreement. I read that D&W are testing whether a teacher evaluation process with real significant consequences, i.e., dismissal or major salary increases, has at least a short term impact on teacher behavior. For the data set they examined and for the IMPACT assessment process, their evidence seems pretty compelling though as Mercedes says it has its potential flaws due to the cheating issue and the brevity of the period observed.
I have written to Dee asking two questions that struck me as I looked at the data in the report:
First, given the known improvement of novice teachers with increased experience, what is the effect of excluding such teachers from your sample?
Second, given the proclivity of all human beings to maximize rewards, what is the effect of excluding teachers who qualify for retirement from your sample?
The first variable is included in their data set. I am unclear whether the second variable is explicitly included since the experience variables are in 5 bands from 0 to 19. SInce these do not add up to 100% , 20 and above years may be the null variable and would therefore be 17% of the sample.
I urge everybody to read the actual Dee and Wyckoff study:
Click to access 16_Dee-Impact.pdf
That is a good idea. Then go to the IMPACT website and compare to the modern version.
The current inclusion of a middle category, “Developing,” and the exclusion of school-wide VAM score for all teachers is a tacit admission that IMPACT in its 2009-11 form isn’t “the” solution the media makes it to be.
Mercedes:
You seem to be implying that more cheating was going on than has been publically acknowledged. Is that correct?
If that is so, what do you make of the results for Group 2 – those teachers without test scores?
Bernie, no quality cheating investigation happened. Adell Cothorne came forward and was fired. Rhee set th stakes painfully high for all DC admin and teachers, so for me to believe that all cheating was dealt with is naive.
As to Group 2: All teachers in the older IMPACT version had VAM as part of their scores. In their case, the VAM was a schoolwide VAM component. Thus, the cheating could have tainted the entire data set.
However, the fact that cheating could have tainted even part of the data set is problematic.
Dee and Wyckoff performed tests to see if there was the possibility of “gaming the system” via higher classroom evals. They found no results to indicate as much. However, once again I return to my distrust of the data to begin with.
Mercedes:
Thanks for the response. D& W Table 1 indicates that the school based VA measure was 5% for both Group 1 and 2. Is that your understanding? If so, then it plays a minimal role in their analysis of Group 2 teachers, surely?
The issue is one of data integrity: “Minimal” is still “present.”
I would not eat a sandwich with “minimal” botulism detected.
Dee and Wyckoff tests regarding “gaming the system” on observations was focused on whether the evaluators showed favor. What was not tested was the degree to which teachers and students “staged a show” for the evaluators.
Even unannounced evaluations can be staged by teachers and students. I do not know that this has been done in DC, but I do know that the pressure is so high that staging would be a temptation.
As a public school teacher, several years ago I was pressured to stage a show in a high-stakes situation. Since the evaluator would not have been included in the plan, there would be no trend of evaluator bias.
For the record, I chose not to stage any show.
deutsch29: with all my experience as a TA in a fairly wide variety of classrooms, I cannot recall a single instance when an “unannounced evaluation” of a classroom was unknown or unexpected to those being evaluated.
If it was an outside team or individual, there were always warnings provided by “my cousin who works in central office” or the like, and if in-school, a friendly school clerk or someone else privy to the ‘secret’ gave advance warning.
Whether the teachers I worked with “staged” a show or not was entirely up to them. But they were never blindsided by the ‘surprise inspection.’
Another instance where very well educated people, with very narrowly defined expertise, are severely hampered (sometimes crippled) by having little or no on-the-job experience.
Thank you for keeping it real.
Not rheeal.
🙂
Thanks, KrazyTA. Any district that employs TFA recruits ought to be banned from high-stakes eval of its career teachers, VAM or no VAM.
I read the study carefully and was struck by the following statement, which is borne out by the data in Table 3: “Interestingly, the individual value-added (IVA) scores received by teachers were also similar across the ME [marginally effective] and HE [highly effective] analytical samples.” In other words, the marginally effective teachers and the highly effective teachers had the same value-added test scores, on average. The differences between them were due to different subjective evaluations by administrators. That seems like a startling indictment of the concept of value-added teacher evaluation yet it is buried in the paper.
I also notice that while the descriptive stats show that there was an overall increase in Impact Score for all teachers from one year to the next, which the authors describe as “suggestive that IMPACT may have had some of its intended effects,” the authors do not provide descriptive data showing whether the value-added scores increased for the sample as a whole.
The authors do provide regression discontinuity evidence that purports to show that the subset of teachers who stood the greatest risk of being fired (marginally effective) significantly improved their value-added scores. Strangely, however, the subset of teachers who were on the cusp of receiving large bonuses did not significantly improve their value-added scores. Furthermore, only one of the two cohorts of marginally effective teachers showed statistically significant improvement in their value-added scores. The one cohort that did showed strikingly high numbers. This strikes me as suspicious.
All in all, the study seems to show that strong carrots and sticks will prompt teachers to receive higher evaluations from administrators. It does not seem to provide much evidence that the Impact Program improved student test scores, or even that it improved teacher performance as measured by student test scores.
Given these results, it would be ironic if this study were used to support the efficacy value-added teacher evaluation.
aronf:
I puzzled over the small gap in IVA scores as well. However, it may not be that surprising since the ME/E sample is weighted towards the E mean and E/HE sample is weighted towards the E mean – if I understand the sampling correctly.
The difference in level of effects between year 1 and year 2 may well reflect the realization that this system does have real consequences. This may not have been so obvious in Year 1 because the actual consequences were not visible until the summer between year 1 and 2.
bernie1815:
I don’t think you are understanding the sampling correctly. The ME sample consists of all teachers in ME and E, and the HE samples consists of all teachers in HE and E (pp 12-13). The mean Impact Scores of the two groups are different, as is directly shown in Table 3 and noted in the text (p 13). Yet, their value-added scores are statistically the same; in fact, the ME group has slightly better value-added scores in the treatment year than the HE group.
Your hypothesis as to the difference between year 1 and 2 seems very speculative. It also doesn’t account for the fact that the HE group, which showed no significant improvement in value-added scores, showed highly significant improvements in the other metrics. This suggests that they were indeed on notice that the system had consequences.
aronf:
I think they need a separate table that more explicitly details their samples for both sets of analysis, i.e., ME/E and E/HE. Pages 11 thru 13 are not clear. The IVA scores are for a relatively small sub-sample.
On page 22, the authors suggest/speculate that the “This evidence implies that, in IMPACT’s second year (i.e., when the policy was more clearly credible), the dismissal threat implied by an ME rating reduced teacher retention dramatically. ” IThey certainly have not proven the point, but it is a reasonable interpretation.
I agree more information would be helpful. In particular I’d like to see more complete descriptive statistics of the entire population broken down by Groups 1 and 2 and individual components of Impact scores.
However, the data that is presented is sufficient to call into serious question whether the value-added scores were meaningful measures of teacher effectiveness in this study. I think pages 11-13 and Table 3 are reasonably clear that the ME and HE samples include ME/E and E/HE, respectively. In any event, Table 3 shows explicitly that the two groups have different Impact scores, particularly in year t but also in year t+1 (there appears to be significant regression to the mean). I cannot think of any plausible explanation why they have the same value-added scores, other than that value-added scores are uncorrelated or very poorly correlated with the other measures of teacher effectiveness.
aronf:
Certainly more information on the composite IMPACT score is in order. The gap between ME and HE needs to be reported.
I don’t quite understand how there can be any debate. Any scientific study which contained fabricated data would be tossed out the window. How can we continue to tout research based on tainted data? All I can conclude is that those who have a vested interest in the success of IMPACT figured that perhaps enough time had passed since the cheating scandal, so people would forget that a full investigation was never performed despite ample evidence of “irregularities.”
2old, thank you. The issue of unresolved cheating makes this data tainted. No result that it produces can be trusted. Besides, Diane’s post following this one is an interview with Adell Cothorne, who has firsthand knowledge of the farce that is IMPACT.
There are so many issues with DC, and Rhee, and cheating, and IMPACT, that I could not subject a teacher to an DC-birthed eval in good conscience. Far from it.
The publicizing of the Dee and Wyckoff study as “evidence” that IMPACT is purging DC of “bad” teachers is callous at best and corrupt at worst.
I think one issue here is that folks are reading more into the paper than is there. The paper finds that if teachers are compensated and terminated based on scores according to a metric, teachers will adjust their classroom practices according to the metric. That is all the paper shows.
Of course, the cheating scandal basically show the same thing: if teacher compensation and termination are based on a particular metric, some teachers and or administrators will abandon their professional ethics and cheat. My students sometimes do the same thing when they are faced by a high stakes exam.
2old2tch:
All empirical data contains contamination of some kind, especially field based data. The authors need to demonstrate why the contaminated data is of limited import. Since the findings hold for Group 2 and the TVA is only 5% of the IMPACT score for Group 2, then this part of the analysis is reasonable. Your and Mercedes concerns are why such studies need to provide other researchers with the opportunity to verify their results by providing access to their data set. It is a standard expectations for publishing research in economics: You cannot hide your data, your methods or your code. But the authors have a right to wait until their article is published before sharing their data.
Sources of error and downright deception are two different animals. We have no way of judging the level of “contamination” because of the lack of a thorough investigation into the cheating. If a kid is caught lying on one occasion, we might cut him some slack in the future. If there is credible evidence that he lied several times, then what he says in the future is going to have to meet a much higher standard. There is credible evidence that there was widespread cheating in the DC schools. Any data they present for analysis is going to have to meet a very high standard. Given that the financial sector of this country has not done such a bang up job in the recent past, I would say that economists would do well to view industry statistics with a sceptical eye and hold them to very high standards. Now for the disclaimer. I am a novice at statistical analysis despite a couple of long ago courses. I do, however, have a well developed sense of right and wrong.
2old2tch:
Have you read the study? The IVA data is available for 20% of the sample. The “cheating” impacted some unknown % of this 20%. The authors argue that their results hold if that 20% of the sample is dropped. I see no evidence to support the notion that all the IMPACT data is contaminated and that, therefore, this study has no merit. Would you honestly be saying the same things if the authors concluded that the IMPACT process led to no change in retention for the ME group and no increased ratings at the margins for either the ME or E/HE group?
” Would you honestly be saying the same things if the authors concluded that the IMPACT process led to no change in retention for the ME group and no increased ratings at the margins for either the ME or E/HE group?”
Yes.