Validity in Classroom Assessment
- Articles, Blog

Validity in Classroom Assessment

Hi, this is Dan Hickey. This is one in a series of short videos
about the practices, principles, and policies of
educational assessment. In this video, I am going to talk about one of the most important principles of all, and that’s the principal validity. Validity
is a topic that’s very near and dear to me. in my work. I’ve focusd on it for much of my career. I’ve written several papers about it/ I have a lot of opinions about it. In this video, I am going to talk about the basics of validity as they relate to instructors,
administrators. and developing researchers, and I’ll throw in
opinions of my own along the way/ Validity differs from reliability
that you’re probably familiar with because we can talk about a test being
reliable or having a property of validity, but a test or assessment can’t be valid, so to speak. When people
refer to an assessment as being valid, that is actually erroneous.
Validity is a property of the evidence. Results from assessments
or tests become evidence when we attempt to use those
scores to support some sort of a claim or inference. Now, “claim” is kind of a strong word for classroom assessment. You’re really probably much more
concerned with what are best called inferences. You’re going to make inferences about what a student knows. When you make inferences, you want those inferences to be valid. In my experience,
validity is a pretty abstract concept/ It’sdifficult and confusing for teachers and
administrators, but I’m often surprised at how many researchers–seasoned
researchers–don’t really have a strong handle on validity because among educational
researchers assessment results really matter when it
comes to a lot of research conclusions. This video should
help you realize why having clear curricular aims are so
important to doing good classroom assessment. Without a clear curricular aim, you
end up focusing on really teaching instruction, but that’s not what assessments
about. Validity in classroom assessments is about what somebody knows or what somebody has learned. As you will see, teaching and instruction
matter, but you have to start with knowing and
learning. Another way to discuss validity, is in terms of defensible inferences. So, first we instruct students
on a particular curricular aim and then we give them some sort of an
assessment to see if they met those aims. We use the scores to make
inferences about students mastery of that aim. We
want our inferences to be defensible and backed up by solid evidence on the
assessment in terms of their scores, we also want to
make sure that both the test and the inferences that we’re making
refer to the curricular aim we hope to measure. If we do all this, then most likely we can
claim that our inferences are valid. In the next few slides, I’ll discuss three types of validity evidence that might be gathered in typical
assessment settings. Content related validity evdience is best
understood in terms of your curricular aim. Does the
assessment evidence target the curricular aim? Criterion related evidence is a bit
different because it’s about whether particular assessment procedure will allow you to predict how well some
student will do on a criterion variable. Construct related evidence concern some underlying
psychological construct. While things like acheivement can be thought of as constructs, you’ll find that construct validity lends itself better to things like motivation and self efficacy. Again, I’ll talk about
that a little more in some more slides. First, I want to give you a general example that talks about the difference between reliability and validity in terms of thinking about///sort or…a
bull’s eye. If you think your curricular aim, the
criterion variable, or some psychological construct as the
bulls eye of the target, then you can think about evidence in
terms of that target. On the upper left you can see that the
that the evidence–in other words those points==they’re all scattered pretty broadly, but
they’re not scattered around the bull’s eye. That evidence is then both unreliable
and unvalid because it’s not really… its
clustered around the spot above the bull’s eye. On the upper right, you see that the data points, so the
speak, are scattered pretty broadly around target.Now in this case, that is valid
evidence because it’s clustered around the center of the bulls eye, but it’s not very reliable. We’ve got kind of lucky. Some of those little points could have gone astray in a different direction. On the lower
left, you see that the data points are really
clustered around a point, but it’s not the center So
what you have there is evidence that is certainly reliable
but it’s not valid. It’s not aligned with the curricular aim. It’s not targeted on your criterion, or it’;s not targeting the construct. On the lower right, you see all of those
points are clustered around the center of the bull’s eye. What you have there is evidence that is both reliable and
valid. So, let’s return to content related validity evidence. This is the extent to which the
assessment procedure adequately represents the content
of the curricular aim that it claims to target. Now the content can
refer to knowledge, skills, or attitude. But really, in most cases, this
is going to be knowledge about something. You might think of this in terms of particular concepts that students understand, facts that are remembered, or skills that can be demonstrated. There has been a big shift in recent years in the way that assessment specialists think about content related validity. People really used to think that…they took it for granted.. that you could break complex knowledge down into a bunch a tiny facts and skills, but
many of us now recognize that curricular aims and assessments need to include so called higher-order skills that really
can’t be meaningfully broken down into a little parts. I am to return to
this, but the point I want to make, though, here is that if you have these
items on your assessment that really get at these higher level things and they are
not broken down those really need to be on that bull;s-eye,
so to speak. When it comes to ensuring content
related validity in your assessments, there are several
ways to think about it. The most important one is known as simply
developmental care, and this is the one that I want to encourage you to think
about and practice the most. That is, when you develop each test item
and, most importantly when you review your completed assessment you want to just really think hard: is
this getting at the content of the curricular aim.
Now, as the stakes get higher you want to start thinking about more sophisticated
methods External review is one. This is similar to the way that you search for bias in assessments. You put together a panel of reviewers. These can be
colleagues. These just need to be people who understand your curricular aim. to give you second opinions on whether each item is really alligned with the curricular aim. Now, more generally there is this notion of alignment and when you get into
high-stakes tests this becomes really important. Here’s an overview of four steps that
Norman Webb has outlined. The first one is called
categorical concurrence And this asks: “Are the same or consistent
categories being used in both the curricular aim and the assessment?” The second one has to do with the depth of knowledge. This gets at what I mentioned earlier. To what extent are the cognitive demands of the curricular aim and assessment the same? Now, one of the problems with validity is that you often see that teachers, , in order to try to get a reliable and efficient assessment, will actually not go
nearly as deep into the curricular aim as they try to go to in the curriculam. In other words, they’ll just have shallow recognition level multiple-choice items
for relatively sophisticated curricular aim involving problem solving, and you end up with evidence that simply
isn’t valid of whether you’ve met the that curricular aim. The third evaluative criteria in terms of
allignment is knowledge correspondece, and this has to do with the span of knowledge refelcted in the curriular aim. This gets back to the sort of spread that we talked about earlier, and whether or not you’re really covering the whole scope of the curricular aim. And then finally, there’s this balance
representation. This is the point I raised earlier. That is, are different curricular aims given appropriate emphasis on a given assessment. Again, if you’ve got a really central sort of problem solving task that’s worth a lot of points that should
be worth a lot of points it’s gonna take students a lot of times a complete that should really be near the center. Let’s shift gears now and talk about criterion related validity evidence. This is the degree to which performance on an
assessment accurately predicts a student’s
performance on some external criterion. For example, think of the SAT. Now, it’s a very common college entrance exam, and it claims to predict how well students will do in college. There’s some evidence. It’s not
as strong as people think, but when you use the criterion variable of freshman GPA /// to a lesser extent graduation rates, you see that the SAT indeed provides valid evidence of whether or not
students meet that criterion. Now, many schools use placement tests, particularly in mathematics. In my experience, some of these assessments can be remarkably invalid in that students who do well on them actually don’t do well in the class that they get into. Other students are screened out of classes that they belong in because they didn’t do well. This is often because the format of the assessment is not alligned to the way the students learned the material earlier, or it’s just poorly designed and
unreliable. The third type of validity evidence is construct related evidence. In technical terms, construct related evidecne of validity is the extent to which the empirical evidence confirms that inferred construct exists and that a given assessment procedure is measuring that construct accurately. So, see this is starting to get a little complicated here. In other words, to have construct relatedd validity there has to be a construct in the first place. So this is what I mean when we are reallly getting into the realm of psychological measuremt here. If you want to gather construct related evidence of validity, you first have to make a hypothesis of students performance based on a construct that you’re measuring. Then you have to gather empirical evidencec to either confirm or reject that
hypothesis. Let’s talk about the ways that test developers gather construct related evidence of validity. The first way is through an intervention study. This is when two or more groups are given
the same initial assessment. Then, some or all of the students receive some sort of intervention and then given a post assessment to see
if the intervention had any effect on the post assessment scores. Sometimes the two groups who received two or more different interventions to examine the difference between the two. Sometimes the intervention can be a more traditional experimental study. where there is a control group that does not receive an intervention. However, in many educational settings
this is problematic. It might be unethical. You’re withholding treatment where education for one group and you don’t know what to do with the
students were in the control group So this is pretty unusual in developing
educational assessments. A second way investigators might gather construct related validity evidence is to do differential population studies. These studies look at how students from
different populations score differently on different assessments. This is
typically done when an assessments is under construction. For example, when the test of English as
a foreign language was being developed, both English language learners and
native English speakers were given the examiner in order to confirm that the different populations of students perform differently. Finally, investigators can also perform
what’s known as related measures study Here, investigators examine a new
assessment as it relates to one that is. already found present evidence of
validity. That evidence can be converging
evidence: meaning that scores in the two assessments, the old and the new, strongly relate to one another. Or this
might be discriminant validity evidence meaning that scores on the two assessments actually have a weak relation to one another. Here are two additional validity related concepts that I want to discuss. These are both really important to me. The first concept
is face validity. This refers to when the appearance of a test seems to correspond with what it claims
to be measuring. Now, the point here is the word “appearance”. To Jim Popham and other asssessment professionals, appearance is not adequate evidence of
validity. They refer to it as an unsanctioned form of validity evidence. In my experience, this is changing pretty
quickly because of social media. Consider, for example, the digital badges that students earn in my assessment course. Students who get more promotions from their classmates for doing a temporary work stand a chance to earn a leader badge. The evidence supporting the assertion that the students is actually a leader is contained in their assignments that they can share in that actual badge…that they can choose to include. But let’s say there is some sort of shenanigans going on and students start agreeing to trade off promotions and people get wind of this. All of the sudden, everybody in the class concludes that the …leader of… the leader badges don’t appear to be valid. So, it really doesn’t matter what I say or what the evidence said if the sort of crowd source idea here is that that’s not valid it doesn’t really matter. So, this is pretty new stuff, but I predicted that
in the coming years as assessment opens up and becomes more transparent and things like digitial badges become more important and information about asssessments and scores start circulating on the web, we’re going to see that, no matter how
valid people say the information is, if people don’t buy the value of the information, it loses its validity. The second concept concerns consequential validity. Consequential validity became widely appreciated in the late eighties and early
nineties with large-scale assessment reforms in the US
and elsewhere. The reaon for introducing high stakes assessment reform was because of the positive consequences that were presumed to follow from portfolio and performance assessment.Now, it turns out that the positive consequences of those assessments really didn’t materialize. And many of the problems, especially reliability and cost, loomed large. My own career was really shaped by the notion of consequential validity. While I was a postdoc at Educational Testing Service, I begin
reading work by Lori Shepard where she talked about…Shepard
began arguing that you we should care about consequential validity, but we shouldn’t wait until after the assessments are administered. She gave a great example of the kindergarten readiness that were very popular in the seventies. Many students were held back one, two, or even three years from entering kindergarten because the
test that they were given concluded that they weren’t ready for kindergarten. These were based on developmental psychology at the time.
Well ,the consequences were disastrous because we ended up with large numbers of poor and minority students a decade later in high school///you know…freshman in high school at seventeen or eighteen years of age. In my work, I think a lot about the consequences of assessment. In particular, I worry that some of our asssessment practices really reduce complex concepts down to things that can be measured and we lose sight of the more consequential and the more critical aspects of that knowledge. So, I really care about those things in my work. So, this really gets at though I think somewhat differently about the knowledge contained that we’re assessing in our curricular aims I am really concerned with the more social and the more…contextual knowledge that’s difficult to put in assessments. I care a lot about
concepts and skills and things that we know how to measure. But I think of those as being sort of secondary concepts. In my work, what I really try to do is
make sure that my classroom assessments really gather evidence that students are
taking away the concepts and skills I want them to have, but I really try not to focus on them in
my instruction. And I really try not to have the
assessment have the consequence of driving too much discussion about what’s going to be on the assessment
because of the very narrow way that assessments represent the important knowledge that I care about. Let’s review the relationship between
reliability and validity. To reiterate, reliability is a property of the assessment whereas validity is the
property of the claims we make with that information. Generally speaking, an assessment needs to be reliable to make valid claims. In my opinion, this
gets a little overblown, particularly towards classroom assessments. The things I care about it’s very
difficult to measure them reliably. So in my opinion, sometimes it
makes sense to sacrifice a little bit reliability in order to get valid assessments,
especially consequentially valid assessments of things that we think are important. Let’s review by thinking about the
relevance of different kinds of validity evidence. For most educators, content related validity evidence is going to be the most important. Now, in my own work I create a lot of assessments for use in both my own classss and in the classes where I do research. When I put on my researcher hat, I really worry a lot about how well my
assessments represent the content of a curricular aim. These are what I call proximal assessements that are oriented to a particular curricula that embodies that curricular aim. On the other hand, criterion related evidence is particularly important for administrators and test developers. While I often measure students achievement of external standards, I am rarely concerned with whether or not students meet meet a particular criteria, Finally, construct related validity evidence is particularly important for researchers. Now, sometimes educators, administrators, and doctoral students are going to try to measure things like self efficacy, motivation, goal orientation. And if that’s
the case, then construct related validity evidence is indeed going to be particularly
important to you Now in my earlier career, spent a lot of time studying motivation. I cared a lot about construct validity evidence, but I
really don’t give it as much thought as I used to. That’s all for now on the topic of validity. I really want to encourage you to think deeply about validity in the context of your own
assessment practice. It was probably a little hard follow some o these points today. I want you to go back over these points, look up information read about it in
texts, and think hard about your own assessments. It’s really the only way to make sense of these concepts. That’s all for now. Thank you very much. Good bye.

About James Carlton

Read All Posts By James Carlton

7 thoughts on “Validity in Classroom Assessment

  1. Excellent explanations. This appears to follow Popham's Chapter 4 in his textbook on Classroom Assessment. I teach a course using that book for this video is particularly relevant and helpful. Thank you.

  2. Thank you for the great video!
    For the upper right 'unreliable, but valid', could you please explain why it is valid because of the data points clustering AROUND the bull eye? I thought it was valid since it has little point RIGHT at the target.

  3. Thank you sir! Very informative video you have. Kudos. Helpful for students taking assessment in learning subject. May God bless you.

Leave a Reply

Your email address will not be published. Required fields are marked *