Let's say we have a survey of students stating if they liked a class and if they passed or failed the class. Now, we want to test if these two variables are independent or not. That is to say, we want to know if the students favorability of the class depends on the grade and vise versa. In this article, we will learn how to test the independence of categorical variables.
Let's start by simulating the survey. We will have 20 students saying if the enjoyed the class or not and if they passed or failed.
enj.sample = sample(c("Yes", "No"), 20, replace = TRUE)
enjoyed = factor(enj.sample)
pass.sample = sample(c("Pass", "Fail"), 20, replace = TRUE)
passed = factor(pass.sample)
To test for independence, we can use the chi-squared
test. To this this in R, we can first use the table
method to create a contingency table, then use the summary
method. This will gived us a chi-squared test and a p-value.
cont.table = table(enjoyed, passed)
summary(cont.table)
# Number of cases in table: 20
# Number of factors: 2
# Test for independence of all factors:
# Chisq = 0.03472, df = 1, p-value = 0.8522
# Chi-squared approximation may be incorrect
From the output we see that the p-value is .8522
which means we fail to reject the null hypothesis and the two factors are independent. This makes sense as we randomly generated the data set.