How Valuable Are Standardized Tests?

A recent post noted a story in the New York Times that described a design flaw in the Texas tests created by Pearson (at $100 million per year). It reported that the state tests did not reflect the improvements observed in connection with an outside intervention because the tests are designed to show improvement only in relationship to previous and future versions of the test.

I am not a statistician or a psychometrician and do not feel competent to say that this is a Eureka! moment. I leave that to others more competent than I. My reaction is that this finding bears further investigation. Otherwise the only way to improve on the tests is to prepare for the tests and to learn the subject in no other way.

Someone commented negatively in response to this post and questioned the claims and “provenance” of the study.

The central figure in the news story, Professor Walter Stroup of the University of Texas, responds:

It’s hard to know what to make of someone who would find the provenance of a PhD thesis “suspicious” because, in its standard use, the word provenance simply refers to “the chronology of the ownership or location of a historical object.” Anyone who has read a thesis would have to know that in conformity with long-established practices such issues are typically addressed in the first few pages. Given this, one can only assume that “provenance” and “suspicion” are invoked in proximity to one another in the previous post for reasons having more to do with an effort to discredit the particular work being discussed. The implication is that somehow the artifact, in this case a PhD thesis by my former advisee Vinh Pham, is not what it purports to be, and thus is worth less than it might be if its provenance was secure.While one might admire the elegance and subtlety of this form of malice and character assignation directed at both myself and, more importantly, my former student, I would suggest that in general: (1) PhD theses, especially those that emerge from top-rated graduate programs, are routinely cited in nearly any realm of formal inquiry as credible sources of scholarship and (2) that the best way to evaluate the quality and significance of that scholarship is to actually read it.

You might also note the NYT article does in fact refer to myself “and two other researchers” in the second of two introductory paragraphs. Both names — Drs. Vinh Pham and Guadalupe Carmona were given to the reporter, Morgan Smith. My guess, and I should stress it is only a guess, is that she left them out only for reasons having to do with style.

Having now addressed your concerns about provenance, I would close by simply expressing our sincere hope that you might now settle into actually reading the work you seem so committed to disparaging. A place to start might be Dr. Pham’s Thesis:

http://generative.edb.utexas.edu/presentations/2009PMENA/pham/VinhHuyPham09Dissertation.pdf

Duane Swacker says:

August 1, 2012 at 6:34 am

“I am not a statistician or a psychometrician” Consider yourself lucky!!-especially on the second one. Psychometrics is a direct descendent of the eugenics movement early in the 20th century.

August 1, 2012 at 7:04 am

From the linked article: “Standardized test scores the previous year were better predictors of their scores the next year than the benchmark test they had taken a few months earlier.”

Let’s look at that a little-“were better predictors”. Okay, what was the correlation coefficient that is high enough to make that statement-.3, .4, or what? A .4 correlation coefficient means that the variable studied has a 1/6 chance of causing the outcome, certainly not very “strong”-guaranteed losing bet 5/6 times. And a .4 is usually considered to be quite good in educational studies. Standardized test result have never been “good” predictors of anything other than maybe correctly guessing SES status and even that doesn’t fare well.

Nope, it’s all “vain and illusory” to quote Noel Wilson, the testing expert that very few have read or heard about, the expert who’s writings on the invalidities of educational standards, standardized testing and grading students has never been refuted/rebutted.

See for yourself in “Educational Standards and the Problem of Error” found at:

http://epaa.asu.edu/ojs/article/view/577 or

“A Little Less than Valid: An Essay Review” found at:

http://www.edrev.info/essays/v10n5index.html

Ed Turley says:

August 1, 2012 at 1:26 pm

Duane, a 0.4 correlation does not have a 1/6 chance of causing the outcome. Correlation cannot be used to infer causation. The losing bet 5/6 times is also incorrect.

There’s a thread on the original post inquiring about whether Stroup’s study has been peer-reviewed. As his work runs counter a large body of peer-reviewed evidence, it’s reasonable to expect that it be peer-reviewed and published before assigning it much weight.

If the study is peer-reviewed and published, it should not be dismissed. It’s still fine to debate it and academics should try to reproduce the findings.

I hope Stroup replies to the questions others are asking about peer-review.

- Marshall says:
  
  August 1, 2012 at 5:21 pm
  
  As Ed notes, correlation has nothing to do with causation. This statement is so cliched, there are memes about it. It also has nothing to do with odds. It’s also not used to “predict” SES or any other non-malleable factors. Typically those go on the other side of the equation. Duane also seems to be completely ignoring Chety et. al.s work. Here’s a link.
  
  http://www.nber.org/papers/w17699
- Duane Swacker says:
  
  August 1, 2012 at 6:05 pm
  
  Ed and Marshall,
  
  Thanks for the responses. I fully understand that correlation is not causation. What I am trying to get at is this. If variable A is correlated with outcome B with a correlation coefficient of .4 (which I think most in education research consider a “good”/”strong” coefficient) then squaring the correlation coefficient give us the “percent above chance”, a probability of A causing B of .16 or 16%. The opposite to that then is that the probability of A not causing B then is 84% (roughly calculating-actually 5/6 -83.3%).
  
  I am not saying/inferring that A causes B, only that there is roughly a 1/6 chance that it MAY cause B. From my understanding of statistics (albeit limited, only from a couple of courses over the years quite awhile ago) one squares a correlation coefficient to get a “probablility” of causation, which is completely different than saying A will cause B one out of six times.
  
  Is not a correlation coefficient of 1 the definition of causation? If A happens and B results every time it is a considered a causation and that relationship has a correlation coefficient of 1.
  
  I’m looking at this from a non-statistician’s viewpoint. Looking at what I thought was fairly common knowledge about a linear relationship between two variables A/B. Where am I wrong in my thinking?
  
  I would like to follow up with a comment on the paper that Marshall linked to, even though I need to be a member to read the whole paper I will go with the summary as linked.
- Teaching economist says:
  
  August 2, 2012 at 2:20 am
  
  Duane,
  
  The short answer answer is no, even a correlation coefficient of 1 does not tell you anything about causation. One of my favorite examples that you should like is that SAT scores are highly correlated with the number of bathrooms in the student’s home. Do you think there is any chance that the number of bathrooms cause high SAT scores?
- Duane Swacker says:
  
  August 2, 2012 at 10:55 am
  
  Marshall,
  
  Your example of number of bathroom being highly correlated with SAT scores does help to drive home the difference between correlated and caused. That helps to a degree.
  
  But highly correlated is not a 1-1 correlation. Do you happen to know what the “real” correlation is? (curiosity sake only on my part) And even if it was 1-1 one would not say the number of bathrooms “caused” the SAT score. So, yes I understand the point.
  
  But then what is the use of using a correlation coefficient to describe some sort of “connecting” information of “data” sets? If all it gives us is a mathematical/statistical statement, how does that coincide with the “real” world of human interactions? Why even bother? What does it “prove”?
- teachingeconomist says:
  
  August 2, 2012 at 11:40 am
  
  I don’t know the number off hand. It is statistical folklore used to illustrate the omitted variable bias. Income is correlated with both higher SAT scores and more bathrooms, possibly in an explanatory way, so higher SAT scores are correlated with more bathrooms even though there is no causal relationship between the two.
  
  Correlation does not prove anything, but it can provide support for a causal explanation. If A causes B, you would expect that A is correlated with B. What you can not do is to say that if A is correlated with B, A causes B, or B causes A.
  
  Correlation can also still be informative even if it is not related to causation. My university is concerned with the very low persistence rate among first year students. It has a very low standards for admission that it has not changed for the those students who apply early in the application process, but as announced that those students applying late will have to be more academically qualified. I suspect that the university has noticed a high correlation between late application and low persistence. I don’t think late application causes lower persistence, rather late application and low persistence are both caused by some unobserved characteristics of the applicant like lack of motivation or poor time management skills. The university is using late application as a signal of these unobserved characteristics and will reject students who apply late in the process even though they would have accepted those very same students if they had applied early in the process.
Ed school researcher says:

August 2, 2012 at 12:21 pm

Another first semester statistics anecdote in addition to Teaching Economist’s is the high degree of correlation between the severity of a fire and the number of firefighters. Using your logic, one would probably recommend not sending a lot of firefighters to very bad fires.

In regards to your question about the usefulness of correlation, in Stroup’s anecdote, the correlation just demonstrates that past student performance is a very good predictor of future performance, just like a prior year’s budget expenditures is a very good predictor of a future year’s budget expenditures. He interprets this as the assessments not being independent and this non-independence leaving very little room for an intervention (or, instruction) to impact a student’s measured learning according to a standardized test.

The problem with this is that I’m not sure whether that’s necessarily true. The program that he refers to in the Tribune article (and that he evaluated) appears to have elicited a rather large, detectable impact on student learning (in another evaluation, this is equated to .5 standard deviation, per year, which is large). Then, I guess, the question is what detectable effect is “large” enough to be satisfactory, after accounting for prior performance. From the press release:

http://education.ti.com/educationportal/sites/US/nonProductSingle/about_press_release_news90.html

And, from the evaluation by Stroup and a co-author:

“The most striking change noted in Table 1 is the increase NCE mean score of the study students’
math TAKS scores from 35.34 in 2005 to 47.72 in 2006. This is particularly noteworthy due to
the fact that all three comparison groups had NCE scores decreased from 2005 to 2006. The
NCE scores for 2005 and 2006 are illustrated in the bar graph in Figure 1 for each of the four
groups of students that were compared across analyses in this first round.”

ceolaf says:

August 1, 2012 at 9:51 am

There’s not a ton new there.

1) Many — including me — have been asking about the violation of unidimentionality in tests for quite a while.

2) Many — including me — have asking about instructional sensitivity in tests for quite a while.

So, what IS new? Well, he modelled it. He showed that there’s a good explanation (i.e. fits the data) for what we see in the scores that is distresssing. It is that there is some latent trait in test takers (e.g. innate ability) that the tests are (accidentally) designed to measure. Because they test so many things (i.e. on the TKS, 65 different math skills) the standard/best procedures of the field produce tests that end up mearsuring that latent trait instead of the 65 indicators.

How?

First, they look for item (i.e. questions) that seem to be well aligned with the others. That is, there is a high correlation between who gets the questions right. A question that people who the the others right are unlikely to answer correctly is tossed out as being weird. The stats for that item are weird and it is not included. This process tends to remove the items that are intructionally sensitive (i.e. for which good instruction can help test takers to learn). Instruction varies across subtopics, and so performance should vary across subtopics, he says.

Second, IRT (the method use to take the perfmance on individual items and put them together for an overall scores, a method that is FAR more advanced that counting them up and doing a %) tend to make this even worse. It looks for what the items have in common, but a math test measure a whole bunch of different maths skills. By looking for what performance has in common, it ignores the differneces in the skills. Well, what is left? Some sort of internal trait in the tes taker that is important to all the items. What if that is simply latent/innate math ability — the kind that just can’t be taught?

Which would lead to consistent reports of students’ innate abilty from year to year. Variations of that ranking are just do to random fluctuation (e.g. someone having a bad day, items that are not as “good”, the wrong kind of pencil, etc..)

Well, he tested that theory. And his model seemed to fit the real data is Texas pretty well. How well? Is it good enough to be sure? No. Not good enough to be sure. But good enough to be askig some serious questions.

Unfortunately, this is not going to be discussed properly anywhere. Probably won’t be. Policy makes can’t understand this very technical stuff. Many in the industry depend on the livelihoods for these kinds of things working, so they’re not going to truly honestly try to undermine their jobs. And the people who DO understand and think this way haven’t been listened to anyway. They have already been warning against value-added assessment of teachers.

So, where is this conversation taking place? This dissertation is not really making new accusations. The NYT article fails to make that clear. It’s not that IRT is bad. It’s just that it might not be suitable to summative testing of an entire content domain. The instructional sensitivity problem? I don’t where where that is being seriously discussed.

The next generation of putting scores together, the one that comes after IRT, is better suited for this. IRT is better than classical test theory. And this new one, Bayes-Networks, is really promising. I don’t know a ton about it, but it really deals with this idea of multi-dimensionality of content. For now, however, we have IRT. Heck, some major tests are just getting to IRT now. It will be 5 or 10 years before Bayes nets are the basis for the big tests.

Duane Swacker says:

August 1, 2012 at 10:53 am

ceolaf,

Have you read the Wilson works that I referenced above your comment? If not it would behoove you to do so. You’d probably then realize that CTT, IRT, Bayes Networks are all “vain and illusory”.

I am reading now Vinh Huy Pham’s dissertation and no, it’s not an easy read, but I see the same logical and rational problems described by Wilson cropping up in Pham’s analysis.

The problem is with psychometrics, per se, and is intractable as shown by Wilson. And the basis of that intractable problem is the logical inconsistency of attempting to quantify, i.e., educational standards and standardized testing, a quality, i.e., the process of teaching and learning. Can’t be logically done! Where’s Spock when you need him!!

Marshall says:

August 1, 2012 at 5:00 pm

It’s a shame you chose to characterize my comment as “negative”, and to only fully quote Stroup’s response, as opposed to giving my comment and response more attention. These are all reasonable questions and are only negative in the sense that scientists are trained to be suspicious and critical of everything. It’s sort of how the whole enterprise works. I do not think it is unfair (or “negative”) to ask for more details about the study, whether it has been subjected to peer-review, whether there is a journal-length article of the findings, and why the work’s primary author is not disseminating the work instead of the person who did not write it. This is very bizarre.

Asking these questions about a study which purports to demonstrate that standardized tests are fundamentally flawed is completely reasonable given the gravity of these accusations.

August 1, 2012 at 6:15 pm

Marshall and Ed,

I was going to reply to the study Marshall linked but have changed my thoughts regarding it because as I mentioned above it has the basic premise fallacy that standardized tests are capable of measuring “something”. Start with a false premise usually end with a false conclusion.

Again, I’ll ask both of you, Have you read Wilson’s work???

If so, please help me to identify any logical, rational mistakes or assumptions he makes. Rebut what he has to say, otherwise we are talking past each other as I’m saying (in concurrence with Wilson) that the whole concept of standards and standardized testing are completely invalid and therefore any conclusions drawn are invalid.

Not to mention using a standardized test for something other than it’s intended usage is unethical. How do you square that with the concept of VAM??

Help me out. You’re wanting me to use a process that is logically and rationally invalid and unethical???

Sorry but this dog don’t hunt that way.

Marshall says:

August 1, 2012 at 9:12 pm

Duane-

In response to your earlier comment about squaring the correlation coefficient, you’re still not there. This has nothing to do with causation, or anything related to this study, really. It’s also technically wrong. I think you’re conflating this with explained variance, but its so muddled it’s hard to tell.

I’d recommend you Google “Causal Inference” and just take a few days reading. Don’t skip over anything by Gary King. I’ll buy you a beer if you see any mention of “squaring a correlation coefficient to determine causation.”

In regards to your most recent comment, which seems to be some sort of postmodern critique of any measurement or quantification in education, I’ve got nothing for you there. If you truly believe that an instrument can’t detect anything (that’s all a test is), then you’re beyond being saved by science. If you want to argue that tests measure aptitude, with error, then this is not something any social scientist would argue. Then, you can have a reasonable discussion about designing tests that minimize this error, and statistical models that acknowledge it and attempt to model it.

- Duane Swacker says:
  
  August 2, 2012 at 12:25 am
  
  Marshall,
  
  Again thanks for the response. Just try to learn a little bit more everyday and your response helps!
  
  “I’ll buy you a beer if you see any mention of “squaring a correlation coefficient to determine causation.”
  
  Now you’re talking my lingo (about buying a beer)!! But I did not say that “squaring a correlation coefficient to determine causation” at all. I just said that it can be perceived as determining a probability of causation, two different beasts.
  
  “In regards to your most recent comment, which seems to be some sort of postmodern critique of any measurement or quantification in education, I’ve got nothing for you there. If you truly believe that an instrument can’t detect anything (that’s all a test is), then you’re beyond being saved by science.”
  
  Is a “postmodern critique” not valid on face value? You appear to say so, if so why? Critique the critique. (Oh, shit , now we’re getting into post post modern critique-ha ha!)
  
  Yes, I am saying that it is impossible to accurately quantify student learning, and by extension teacher effectiveness, It’s a chimera, falsehood, invalid. Show me how you think it isn’t and I’ll show you the myriad ways it is.
  
  “. . . then you’re beyond being saved by science.” No, not being saved by science, “I ain’t saved by nuthin”, what you mean is not being saved by the pseudo science that is psychometrics and all its machinations. No, my position is the logical and rational, scientific position not yours!!
  
  Duane
- teachingeconomist says:
  
  August 2, 2012 at 2:19 pm
  
  Duane-
  
  When teachers grade work done by students, what is that grade based on? You say it is in principle impossible that it can be based on student learning, yet a wide range of grades are given to students in the same high school.

Jeffrey Weiss says:

August 8, 2012 at 1:15 pm

I’m a reporter for the Dallas Morning News working on a piece about Stroup’s work. I’m looking for people who can help me put it into context. Some of y’all who have posted here appear to have the credentials for that. But haven’t exactly made it easy to contact you…1:-{)>

jweiss@dallasnews.com

How Valuable Are Standardized Tests?

17 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats

How Valuable Are Standardized Tests?

Share this:

17 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Blog Topics

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats