Pearson’s chi square test (goodness of fit) | Probability and Statistics | Khan Academy
- Articles, Blog

Pearson’s chi square test (goodness of fit) | Probability and Statistics | Khan Academy


I’m thinking about
buying a restaurant, so I go and ask
the current owner, what is the distribution
of the number of customers you get each day? And he says, oh, I’ve
already figure that out. And he gives me
this distribution over here, which essentially
says 10% of his customers come in on Monday, 10% on
Tuesday, 15% on Wednesday, so forth, and so on. They’re closed on Sunday. So this is 100% of the
customers for a week. If you add that
up, you get 100%. I obviously am a
little bit suspicious, so I decide to see how good
this distribution that he’s describing actually
fits observed data. So I actually observe the number
of customers, when they come in during the week,
and this is what I get from my observed data. So to figure out whether
I want to accept or reject his hypothesis right
here, I’m going to do a little bit
of a hypothesis test. So I’ll make the null hypothesis
that the owner’s distribution– so that’s this thing
right here– is correct. And then the
alternative hypothesis is going to be that
it is not correct, that it is not a
correct distribution, that I should not feel
reasonably OK relying on this. It’s not the correct–
I should reject the owner’s distribution. And I want to do this with
a significance level of 5%. Or another way of
thinking about it, I’m going to calculate a
statistic based on this data right here. And it’s going to be
chi-square statistic. Or another way to view
it is it that statistic that I’m going to
calculate has approximately a chi-square distribution. And given that it does have
a chi-square distribution with a certain number
of degrees of freedom and we’re going to calculate
that, what I want to see is the probability of
getting this result, or getting a result like
this or a result more extreme less than 5%. If the probability of getting
a result like this or something less likely than
this is less than 5%, then I’m going to reject
the null hypothesis, which is essentially just rejecting
the owner’s distribution. If I don’t get
that, if I say, hey, the probability of getting
a chi-square statistic that is this extreme or more
is greater than my alpha, than my significance level,
then I’m not going to reject it. I’m going to say,
well, I have no reason to really assume
that he’s lying. So let’s do that. So to calculate the chi-square
statistic, what I’m going to do is– so here we’re assuming
the owner’s distribution is correct. So assuming the
owner’s distribution was correct, what would have
been the expected observed? So we have expected
percentage here, but what would have been
the expected observed? So let me write this right here. Expected. I’ll add another row, Expected. So we would have expected
10% of the total customers in that week to
come in on Monday, 10% of the total
customers of that week to come in on Tuesday, 15%
to come in on Wednesday. Now to figure out what
the actual number is, we need to figure out the
total number of customers. So let’s add up these
numbers right here. So we have– I’ll get
the calculator out. So we have 30 plus 14 plus
34 plus 45 plus 57 plus 20. So there’s a total
of 200 customers who came into the
restaurant that week. So let me write this down. So this is equal to– so I
wrote the total over here. Ignore this right here. I had 200 customers
come in for the week. So what was the expected
number on Monday? Well, on Monday, we would
have expected 10% of the 200 to come in. So this would have been 20
customers, 10% times 200. On Tuesday, another 10%. So we would have
expected 20 customers. Wednesday, 15% of 200,
that’s 30 customers. On Thursday, we would have
expected 20% of 200 customers, so that would have
been 40 customers. Then on Friday, 30%, that
would have been 60 customers. And then on Friday 15% again. 15% of 200 would have
been 30 customers. So if this distribution
is correct, this is the actual number
that I would have expected. Now to calculate
chi-square statistic, we essentially just take–
let me just show it to you, and instead of
writing chi, I’m going to write capital X squared. Sometimes someone will write the
actual Greek letter chi here. But I’ll write the
x squared here. And let me write it this way. This is our
chi-square statistic, but I’m going to write it with
a capital X instead of a chi because this is going
to have approximately a chi-squared distribution. I can’t assume
that it’s exactly, so this is where we’re dealing
with approximations right here. But it’s fairly
straightforward to calculate. For each of the days,
we take the difference between the observed
and expected. So it’s going to
be 30 minus 20– I’ll do the first one
color coded– squared divided by the expected. So we’re essentially
taking the square of almost you could kind of
do the error between what we observed and expected or
the difference between what we observed and expect, and
we’re kind of normalizing it by the expected right over here. But we want to take the
sum of all of these. So I’ll just do all
of those in yellow. So plus 14 minus 20 squared
over 20 plus 34 minus 30 squared over 30 plus– I’ll continue
over here– 45 minus 40 squared over 40 plus 57 minus
60 squared over 60, and then finally, plus 20
minus 30 squared over 30. I just took the observed
minus the expected squared over the expected. I took the sum of
it, and this is what gives us our
chi-square statistic. Now let’s just calculate what
this number is going to be. So this is going to be equal
to– I’ll do it over here so you don’t run out of space. So we’ll do this a new color. We’ll do it in orange. This is going to be
equal to 30 minus 20 is 10 squared, which is 100
divided by 20, which is 5. I might not be able to do all
of them in my head like this. Plus, actually, let me
just write it this way just so you can
see what I’m doing. This right here is 100
over 20 plus 14 minus 20 is negative 6 squared
is positive 36. So plus 36 over 20. Plus 34 minus 30 is
4, squared is 16. So plus 16 over 30. Plus 45 minus 40
is 5 squared is 25. So plus 25 over 40. Plus the difference
here is 3 squared is 9, so it’s 9 over 60. Plus we have a difference of
10 squared is plus 100 over 30. And this is equal to– and I’ll
just get the calculator out for this– this is
equal to, we have 100 divided by 20
plus 36 divided by 20 plus 16 divided by 30
plus 25 divided by 40 plus 9 divided by 60 plus 100
divided by 30 gives us 11.44. So let me write that down. So this right here
is going to be 11.44. This is my chi-square
statistic, or we could call it a big
capital X squared. Sometimes you’ll have it
written as a chi-square, but this statistic is
going to have approximately a chi-square distribution. Anyway, with that
said, let’s figure out, if we assume that it has roughly
a chi-square distribution, what is the probability of getting a
result this extreme or at least this extreme, I guess is another
way of thinking about it. Or another way of saying, is
this a more extreme result than the critical
chi-square value that there’s a 5% chance of
getting a result that extreme? So let’s do it that way. Let’s figure out the
critical chi-square value. And if this is more
extreme than that, then we will reject
our null hypothesis. So let’s figure out our
critical chi-square values. So we have an alpha of 5%. And actually the other
thing we have to figure out is the degrees of freedom. The degrees of freedom, we’re
taking one, two, three, four, five, six sums, so
you might be tempted to say the degrees
of freedom are six. But one thing to
realize is that if you had all of this
information over here, you could actually figure out
this last piece of information, so you actually have
five degrees of freedom. When you have just kind of
n data points like this, and you’re measuring kind of
the observed versus expected, your degrees of freedom
are going to be n minus 1, because you could figure
out that nth data point just based on everything
else that you have, all of the other information. So our degrees of freedom
here are going to be 5. It’s n minus 1. So our significance level is 5%. And our degrees of freedom is
also going to be equal to 5. So let’s look at our
chi-square distribution. We have a degree
of freedom of 5. We have a significance
level of 5%. And so the critical
chi-square value is 11.07. So let’s go with this chart. So we have a
chi-squared distribution with a degree of freedom of 5. So that’s this distribution
over here in magenta. And we care about a
critical value of 11.07. So this is right here. Oh, you actually even
can’t see it on this. So if I were to keep drawing
this magenta thing all the way over here, if the
magenta line just kept going, over here, you’d have 8. Over here you’d have 10. Over here, you’d have 12. 11.07 is maybe some
place right over there. So what it’s saying
is the probability of getting a result at least
as extreme as 11.07 is 5%. So we could write it even here. Our critical chi-square value is
equal to– we just saw– 11.07. Let me look at the chart again. 11.07. The result we got
for our statistic is even less likely than that. The probability is less
than our significance level. So then we are going to reject. So the probability
of getting that is– let me put it this
way– 11.44 is more extreme than our
critical chi-square level. So it’s very unlikely that
this distribution is true. So we will reject
what he’s telling us. We will reject
this distribution. It’s not a good fit based
on this significance level.

About James Carlton

Read All Posts By James Carlton

62 thoughts on “Pearson’s chi square test (goodness of fit) | Probability and Statistics | Khan Academy

  1. Hypothetically, the null hypothesis and the alternative hypothesis could be switched, right? If not, how do you know which one is the null hypothesis and which one is the alternative hypothesis?

  2. Please correct if I am wrong but it will be a two tailed tests and the P value to consider would be for 0.975% (12.83). So as 11.44 is within 12.83, we would accept the hypothesis.

  3. Excellent thank you GK. \ For myself only \ X is a multinomial RV with 6 events with p_1 = .1, p_2 = .1, etc. What is the probability P(X1=30, X2=14, etc) = 200! / 30!14! etc p_1 ^ 30 p_2 … \ this will be an extremely low #. Also is the shop owners distribution really a mulitnomial? That would be if like 200 people each chose a day independently of whwat ot go to. hmmm… to come back to later

  4. Man, Sal, i thought i was done with khanacademy in grade 12 and here i am in my final semester of engineering :')

  5. By 'very unlikely' you mean that there is only a slightly higher than 5% chance that the null hypothesis is untrue

  6. Thanks a lot for the explanation. What would happen if the random variables were not independent? Would it still follow Chi square distribution? How would it affect the degree of freedom?

  7. You don't have to do the chi square test to see that for 200 customers a week is not worth buying :)) just kidding, congrats for the video!

  8. What is the t value u enter when creating a table. And the a= . Is a=.05. So a chi sqaure test was conducted for goodness of fit (a=??)

  9. how can you know the last datapoint from the others?! i mean it is a datapoint not the next point in a function
    as I remember th n-1 is used so you dont overestimate, which I dont really get to be honest

  10. How do you know the distribution of the samples follows the chi square distribution. If it just for the example then how would you find out the samples follow the chi square distribution?

  11. Your video is good, but i think that you can change it to table and show them the working on how you derive the calculation. That is what my lecturer taught me. Thanks

  12. Currently studying for the USMLE STEP 3, which is heavy on biostats. As a physician, I will NEVER USE THIS KNOWLEDGE! WHY MUST I LEARN IT!?!?! :((((

  13. This example assumes that the confidence level is 95% and an alpha of 0.05 or (1 – .95), that is the risk of making a Type 1 error (a false positive – rejecting the null hypothesis when it shouldn't have been rejected). As the vid shows, the chi-square statistic of 11.44 > critical value of 11.07, so we reject the null hypothesis and accept the alternative hypothesis. We get the same result by looking at p of 0.04 < alpha of 0.05. Imagine the example used a confidence level of 99% instead of 95%, we'd have an alpha of 0.01 instead of 0.05. With an alpha of 0.01, the chi-square statistic would still be 11.44, but it would < a new critical value of 15.09, so we could fail to reject the null hypothesis. Again, we would get the same result by looking at p of 0.04 > alpha of 0.01.

  14. Don't forget everyone! The chi-squared statistic is actually a measure of badness of fit, not goodness of fit, like it would seem obvious to be.

  15. One thing I don't entirely understand is why do we almost always choose a significance level of 5%? And what exactly would we gain or lose if we make it more or less than 5%? If we make it higher than 5%, doesn't it then make the hypothesis more difficult to accept, since it would require more accuracy to accept it? This would seem to make it more easy to reject a true hypothesis, but at the same time, doesn't it also make it more likely that we won't accept a false hypothesis?

    Edit: I read that medical experiments have a tendency to choose a significance level of 1%, but wouldn't it be safer to choose a higher significance level, so that a false hypothesis won't be accepted, or did I misunderstand something?

  16. why does he have two rows of expected percent? seems that he ignored the first one for some reason..

  17. Hi! thanks for the amazing content. Just a little question on the assumed null hypothesis: don't we usually take the "NO news" kind of thing in null. I mean would it be more conventional if we state the null hypothesis to be: "Owner's distribution is incorrect"

  18. so the p-value would be 0.043293 which is < 0.05 so reject the null hypothesis – same result as using critical values 11.07 < 11.44.

  19. So if our calculated number was like, 9.something, then it being smaller than 11.07 means we accept the null hypothesis???

Leave a Reply

Your email address will not be published. Required fields are marked *