I’m thinking about

buying a restaurant, so I go and ask

the current owner, what is the distribution

of the number of customers you get each day? And he says, oh, I’ve

already figure that out. And he gives me

this distribution over here, which essentially

says 10% of his customers come in on Monday, 10% on

Tuesday, 15% on Wednesday, so forth, and so on. They’re closed on Sunday. So this is 100% of the

customers for a week. If you add that

up, you get 100%. I obviously am a

little bit suspicious, so I decide to see how good

this distribution that he’s describing actually

fits observed data. So I actually observe the number

of customers, when they come in during the week,

and this is what I get from my observed data. So to figure out whether

I want to accept or reject his hypothesis right

here, I’m going to do a little bit

of a hypothesis test. So I’ll make the null hypothesis

that the owner’s distribution– so that’s this thing

right here– is correct. And then the

alternative hypothesis is going to be that

it is not correct, that it is not a

correct distribution, that I should not feel

reasonably OK relying on this. It’s not the correct–

I should reject the owner’s distribution. And I want to do this with

a significance level of 5%. Or another way of

thinking about it, I’m going to calculate a

statistic based on this data right here. And it’s going to be

chi-square statistic. Or another way to view

it is it that statistic that I’m going to

calculate has approximately a chi-square distribution. And given that it does have

a chi-square distribution with a certain number

of degrees of freedom and we’re going to calculate

that, what I want to see is the probability of

getting this result, or getting a result like

this or a result more extreme less than 5%. If the probability of getting

a result like this or something less likely than

this is less than 5%, then I’m going to reject

the null hypothesis, which is essentially just rejecting

the owner’s distribution. If I don’t get

that, if I say, hey, the probability of getting

a chi-square statistic that is this extreme or more

is greater than my alpha, than my significance level,

then I’m not going to reject it. I’m going to say,

well, I have no reason to really assume

that he’s lying. So let’s do that. So to calculate the chi-square

statistic, what I’m going to do is– so here we’re assuming

the owner’s distribution is correct. So assuming the

owner’s distribution was correct, what would have

been the expected observed? So we have expected

percentage here, but what would have been

the expected observed? So let me write this right here. Expected. I’ll add another row, Expected. So we would have expected

10% of the total customers in that week to

come in on Monday, 10% of the total

customers of that week to come in on Tuesday, 15%

to come in on Wednesday. Now to figure out what

the actual number is, we need to figure out the

total number of customers. So let’s add up these

numbers right here. So we have– I’ll get

the calculator out. So we have 30 plus 14 plus

34 plus 45 plus 57 plus 20. So there’s a total

of 200 customers who came into the

restaurant that week. So let me write this down. So this is equal to– so I

wrote the total over here. Ignore this right here. I had 200 customers

come in for the week. So what was the expected

number on Monday? Well, on Monday, we would

have expected 10% of the 200 to come in. So this would have been 20

customers, 10% times 200. On Tuesday, another 10%. So we would have

expected 20 customers. Wednesday, 15% of 200,

that’s 30 customers. On Thursday, we would have

expected 20% of 200 customers, so that would have

been 40 customers. Then on Friday, 30%, that

would have been 60 customers. And then on Friday 15% again. 15% of 200 would have

been 30 customers. So if this distribution

is correct, this is the actual number

that I would have expected. Now to calculate

chi-square statistic, we essentially just take–

let me just show it to you, and instead of

writing chi, I’m going to write capital X squared. Sometimes someone will write the

actual Greek letter chi here. But I’ll write the

x squared here. And let me write it this way. This is our

chi-square statistic, but I’m going to write it with

a capital X instead of a chi because this is going

to have approximately a chi-squared distribution. I can’t assume

that it’s exactly, so this is where we’re dealing

with approximations right here. But it’s fairly

straightforward to calculate. For each of the days,

we take the difference between the observed

and expected. So it’s going to

be 30 minus 20– I’ll do the first one

color coded– squared divided by the expected. So we’re essentially

taking the square of almost you could kind of

do the error between what we observed and expected or

the difference between what we observed and expect, and

we’re kind of normalizing it by the expected right over here. But we want to take the

sum of all of these. So I’ll just do all

of those in yellow. So plus 14 minus 20 squared

over 20 plus 34 minus 30 squared over 30 plus– I’ll continue

over here– 45 minus 40 squared over 40 plus 57 minus

60 squared over 60, and then finally, plus 20

minus 30 squared over 30. I just took the observed

minus the expected squared over the expected. I took the sum of

it, and this is what gives us our

chi-square statistic. Now let’s just calculate what

this number is going to be. So this is going to be equal

to– I’ll do it over here so you don’t run out of space. So we’ll do this a new color. We’ll do it in orange. This is going to be

equal to 30 minus 20 is 10 squared, which is 100

divided by 20, which is 5. I might not be able to do all

of them in my head like this. Plus, actually, let me

just write it this way just so you can

see what I’m doing. This right here is 100

over 20 plus 14 minus 20 is negative 6 squared

is positive 36. So plus 36 over 20. Plus 34 minus 30 is

4, squared is 16. So plus 16 over 30. Plus 45 minus 40

is 5 squared is 25. So plus 25 over 40. Plus the difference

here is 3 squared is 9, so it’s 9 over 60. Plus we have a difference of

10 squared is plus 100 over 30. And this is equal to– and I’ll

just get the calculator out for this– this is

equal to, we have 100 divided by 20

plus 36 divided by 20 plus 16 divided by 30

plus 25 divided by 40 plus 9 divided by 60 plus 100

divided by 30 gives us 11.44. So let me write that down. So this right here

is going to be 11.44. This is my chi-square

statistic, or we could call it a big

capital X squared. Sometimes you’ll have it

written as a chi-square, but this statistic is

going to have approximately a chi-square distribution. Anyway, with that

said, let’s figure out, if we assume that it has roughly

a chi-square distribution, what is the probability of getting a

result this extreme or at least this extreme, I guess is another

way of thinking about it. Or another way of saying, is

this a more extreme result than the critical

chi-square value that there’s a 5% chance of

getting a result that extreme? So let’s do it that way. Let’s figure out the

critical chi-square value. And if this is more

extreme than that, then we will reject

our null hypothesis. So let’s figure out our

critical chi-square values. So we have an alpha of 5%. And actually the other

thing we have to figure out is the degrees of freedom. The degrees of freedom, we’re

taking one, two, three, four, five, six sums, so

you might be tempted to say the degrees

of freedom are six. But one thing to

realize is that if you had all of this

information over here, you could actually figure out

this last piece of information, so you actually have

five degrees of freedom. When you have just kind of

n data points like this, and you’re measuring kind of

the observed versus expected, your degrees of freedom

are going to be n minus 1, because you could figure

out that nth data point just based on everything

else that you have, all of the other information. So our degrees of freedom

here are going to be 5. It’s n minus 1. So our significance level is 5%. And our degrees of freedom is

also going to be equal to 5. So let’s look at our

chi-square distribution. We have a degree

of freedom of 5. We have a significance

level of 5%. And so the critical

chi-square value is 11.07. So let’s go with this chart. So we have a

chi-squared distribution with a degree of freedom of 5. So that’s this distribution

over here in magenta. And we care about a

critical value of 11.07. So this is right here. Oh, you actually even

can’t see it on this. So if I were to keep drawing

this magenta thing all the way over here, if the

magenta line just kept going, over here, you’d have 8. Over here you’d have 10. Over here, you’d have 12. 11.07 is maybe some

place right over there. So what it’s saying

is the probability of getting a result at least

as extreme as 11.07 is 5%. So we could write it even here. Our critical chi-square value is

equal to– we just saw– 11.07. Let me look at the chart again. 11.07. The result we got

for our statistic is even less likely than that. The probability is less

than our significance level. So then we are going to reject. So the probability

of getting that is– let me put it this

way– 11.44 is more extreme than our

critical chi-square level. So it’s very unlikely that

this distribution is true. So we will reject

what he’s telling us. We will reject

this distribution. It’s not a good fit based

on this significance level.

Hypothetically, the null hypothesis and the alternative hypothesis could be switched, right? If not, how do you know which one is the null hypothesis and which one is the alternative hypothesis?

where did he get the 5 from for the N-1 degrees of freedom?

exam in 2 hours, pray for me #uqadvantage

why does the X² test divides to the expected value and not the variance?

why do you reject the hypothesis in the end? Can somebody help me out please?

What software is used in the video for the drawing?

Why have to taken 5% of significance level?

thank you so much

is there another video like this where it shows the formulas for each step? thanks

Well explained, thank you!

Thank god for you khan academy.

that was awesome

test in 2 hourś

What is the chi square test? Application of chi square test in genetic research?

Please correct if I am wrong but it will be a two tailed tests and the P value to consider would be for 0.975% (12.83). So as 11.44 is within 12.83, we would accept the hypothesis.

Excellent thank you GK. \ For myself only \ X is a multinomial RV with 6 events with p_1 = .1, p_2 = .1, etc. What is the probability P(X1=30, X2=14, etc) = 200! / 30!14! etc p_1 ^ 30 p_2 … \ this will be an extremely low #. Also is the shop owners distribution really a mulitnomial? That would be if like 200 people each chose a day independently of whwat ot go to. hmmm… to come back to later

The observed values don't add up to 100?

It is very unfair to dislike this tutorial.

Over 7 years and this is still very useful 🙂

thanks for the good video 🙂

Rejuct

if they teach like that in schools, we would all be statisticians.

Man, Sal, i thought i was done with khanacademy in grade 12 and here i am in my final semester of engineering :')

i love u khan

When the exam is in 2.5 hrs

Trying to study before the exam ^^

By 'very unlikely' you mean that there is only a slightly higher than 5% chance that the null hypothesis is untrue

Thanks a lot for the explanation. What would happen if the random variables were not independent? Would it still follow Chi square distribution? How would it affect the degree of freedom?

Does anyone know where Sal get's his calculator? thx

THANK YOU :* video is comprehensible 🙂

You don't have to do the chi square test to see that for 200 customers a week is not worth buying :)) just kidding, congrats for the video!

Nice. Help me in analysing research data

What is the t value u enter when creating a table. And the a= . Is a=.05. So a chi sqaure test was conducted for goodness of fit (a=??)

how can you know the last datapoint from the others?! i mean it is a datapoint not the next point in a function

as I remember th n-1 is used so you dont overestimate, which I dont really get to be honest

How come he got a degree of 5 instead of 6?

The selection of alpha is not clear

How do you know the distribution of the samples follows the chi square distribution. If it just for the example then how would you find out the samples follow the chi square distribution?

Thank you!

amazing

Your video is good, but i think that you can change it to table and show them the working on how you derive the calculation. That is what my lecturer taught me. Thanks

Currently studying for the USMLE STEP 3, which is heavy on biostats. As a physician, I will NEVER USE THIS KNOWLEDGE! WHY MUST I LEARN IT!?!?! :((((

Just Wow 🙂

This example assumes that the confidence level is 95% and an alpha of 0.05 or (1 – .95), that is the risk of making a Type 1 error (a false positive – rejecting the null hypothesis when it shouldn't have been rejected). As the vid shows, the chi-square statistic of 11.44 > critical value of 11.07, so we reject the null hypothesis and accept the alternative hypothesis. We get the same result by looking at p of 0.04 < alpha of 0.05. Imagine the example used a confidence level of 99% instead of 95%, we'd have an alpha of 0.01 instead of 0.05. With an alpha of 0.01, the chi-square statistic would still be 11.44, but it would < a new critical value of 15.09, so we could fail to reject the null hypothesis. Again, we would get the same result by looking at p of 0.04 > alpha of 0.01.

so if the chi squared statistic i got from the sum is smaller than that from the table, what does that mean?

And i get it faster than the lectures earlier 🤯

Dude, you a genius!!

Thanks for this!! My exams are in 3 days time.

saturdaysuperb video

Don't forget everyone! The chi-squared statistic is actually a measure of badness of fit, not goodness of fit, like it would seem obvious to be.

my brain hurts

hopefully the complete problem will be stated.. thank you

What’s the P value

AP test tomorrow :((

Good luck students taking the AP statistics tmr! Pray for me guys

One thing I don't entirely understand is why do we almost always choose a significance level of 5%? And what exactly would we gain or lose if we make it more or less than 5%? If we make it higher than 5%, doesn't it then make the hypothesis more difficult to accept, since it would require more accuracy to accept it? This would seem to make it more easy to reject a true hypothesis, but at the same time, doesn't it also make it more likely that we won't accept a false hypothesis?

Edit: I read that medical experiments have a tendency to choose a significance level of 1%, but wouldn't it be safer to choose a higher significance level, so that a false hypothesis won't be accepted, or did I misunderstand something?

why does he have two rows of expected percent? seems that he ignored the first one for some reason..

Thank you very much…..excellent explanation

Hi! thanks for the amazing content. Just a little question on the assumed null hypothesis: don't we usually take the "NO news" kind of thing in null. I mean would it be more conventional if we state the null hypothesis to be: "Owner's distribution is incorrect"

so the p-value would be 0.043293 which is < 0.05 so reject the null hypothesis – same result as using critical values 11.07 < 11.44.

Thank you

Thank youuuu it really helped 👍🏻💙

So if our calculated number was like, 9.something, then it being smaller than 11.07 means we accept the null hypothesis???