Statistics | ANOVA

While a t-test can be used to test whether the means of two groups are significantly different from each other, ANOVA (analysis of variance) can be used to test whether there is a significant difference between the means of three or more groups.

Performing an ANOVA test involves making some assumptions about the data:

The means must be a metric such as age or height.
The metric must be approximately normally distributed.
The variance in each group must be approximately equal.
Observations within each group are randomly sampled, and are independent.

ANOVA: Which Position Scores the Most Points in FPL?

If we had a hypothesis that ‘On average, there is a significant difference between the exPected points of players in each position in FPL’, we could try and test if this is true using an ANOVA.

The Null Hypothesis here is that players in all positions are exPected to score approximately the same number of points.
The Alternative Hypothesis here is that on average, players in different positions are exPected to score a different number of points.

1. Preparing Data

First, let’s load the packages we need and download the data from GW1.

# Load packages
library(tidyverse)
library(ggsignif)
library(car)

# Set theme
theme <- theme_classic()+ theme(text = element_text(size = 12, family='Radio Canada Big'))

# Get data for players who have played at least 1 minute in each week
github <- 'https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2023-24/'
filenames <- sprintf("gw%s",seq(1:38))
PlayersList <- lapply(filenames, function(x) {
  data <- subset(read_csv(url(paste0(github, "gws/",x,".csv"))), minutes>0)
  return(data)}) 
Players <- data.frame(data.table::rbindlist(PlayersList))

From the downloaded data, we can select a sample to perform the ANOVA test on.

# Filter for players that played 70+ mins and select 10 players in each position
set.seed(1)
Players70 <- Players %>% filter(minutes>70) %>% 
  select(name, position, xP) %>% 
  group_by(position) %>%
  slice_sample(n=10) %>%
  as.data.frame()

We can visualise the sample using a boxPlot.

# BoxPlot
ggplot(Players70, aes(x=position, y=xP, fill=position))+
  geom_boxplot()+
  theme+
  scale_fill_manual(values=c('#a9c279', '#f04d23', '#719fc0', '#f6eb74'))+
  xlab('Position')+ ylab('xP')+ guides(fill="none")

The boxPlot indicates that whilst the median values for each position are quite similar, there is a difference in the range of points scored in each position.

2. Checking for Normality

Now we need to check whether our data is normally distributed, as this is an assumption of the t-test. We can use a QQ plot to see if actual team sizes reflect theorectical sizes predicted from a normal distribution.

# Does data follow normal distribution: QQ plot
ggplot(Players70, aes(sample=xP)) + 
  geom_qq_line() +
  stat_qq() +
  theme_classic() + 
  theme + 
  labs(x='Theoretical xP', y='Actual xP')

From the QQ plot it appears that this data is normally distributed. We can also use a Shapiro-Wilk test to check that the data is normal.

# Does data follow normal distribution: Shapiro-Wilks
shapiro.test(Players70$xP)

## 
##  Shapiro-Wilk normality test
## 
## data:  Players70$xP
## W = 0.94719, p-value = 0.06074

The data is just non-significant, so we can use a ANOVA here.

3. Checking for Equal Variance

Something else we need to check is whether our two groups have approximately equal variance. This can be tested using Levene’s test.

# Does data for each team have equal variance: Levene's test
Players70$position <- as.factor(Players70$position)
leveneTest(xP ~ position, Players70)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  3  1.8141  0.162
##       36

Levene’s test returns a non-significant result, meaning we cannot reject the null hypothesis that the variances are equal. This means the data has approximately equal variance.

4. SSW: Sums of Squares Within Groups

To perform the first step of an ANOVA test, the data needs to be ‘widened’. Here is what the data looks like now:

knitr::kable(head(Players70))

name	position	xP
Joël Veltman	DEF	1.2
Matty Cash	DEF	3.3
Chris Mepham	DEF	3.0
Trent Alexander-Arnold	DEF	5.2
Toti António Gomes	DEF	4.5
Connor Roberts	DEF	3.0

If we pivot the table it looks like this:

# Set indices for each group of 10
Players70$index <- rep(1:10,4)

# Pivot table
Players70PW <- Players70 %>% pivot_wider(id_cols=!name, names_from='position', values_from='xP')
knitr::kable(head(Players70PW))

index	DEF	FWD	GK	MID
1	1.2	5.3	3.7	2.1
2	3.3	3.5	2.7	1.8
3	3.0	5.0	2.0	11.0
4	5.2	8.1	3.2	1.5
5	4.5	2.7	8.5	5.7
6	3.0	4.5	0.7	6.3

With the pivoted table, we can calculate the squared difference from the mean of each group for each value

# Square differences between group values and the mean of each group
Players70PW$DEFDiffSq <- (Players70PW$DEF - mean(Players70PW$DEF))**2
Players70PW$FWDDiffSq <- (Players70PW$FWD - mean(Players70PW$FWD))**2
Players70PW$GKDiffSq <- (Players70PW$GK - mean(Players70PW$GK))**2
Players70PW$MIDDiffSq <- (Players70PW$MID - mean(Players70PW$MID))**2
knitr::kable(head(Players70PW))

index	DEF	FWD	GK	MID	DEFDiffSq	FWDDiffSq	GKDiffSq	MIDDiffSq
1	1.2	5.3	3.7	2.1	2.2201	0.5929	0.1089	5.9049
2	3.3	3.5	2.7	1.8	0.3721	1.0609	0.4489	7.4529
3	3.0	5.0	2.0	11.0	0.0961	0.2209	1.8769	41.8609
4	5.2	8.1	3.2	1.5	6.3001	12.7449	0.0289	9.1809
5	4.5	2.7	8.5	5.7	3.2761	3.3489	26.3169	1.3689
6	3.0	4.5	0.7	6.3	0.0961	0.0009	7.1289	3.1329

Next we get the sum of all the SSW values

# Sum the values of the newly created columns
ssw <- sum(Players70PW$DEFDiffSq)+ sum(Players70PW$FWDDiffSq)+ sum(Players70PW$GKDiffSq)+ sum(Players70PW$MIDDiffSq)
message("SSW: ", ssw)

## SSW: 179.772

5. SST: Total Sum of Squares

First we get all the values from our 40 samples, before finding the squared difference from the mean for each sample.

# Squared difference of each sample
Players70PWStacked <- stack(Players70PW[, 2:5])$values
Players70PWDiffSq <- (mean(Players70PWStacked) - Players70PWStacked)**2
sst <- sum(Players70PWDiffSq)
message("SST: ", sst)

## SST: 204.584

6. SSB: Sum of Squares Between Groups

SSB is just SST - SSW:

ssb <- sst - ssw
message("SSB: ", ssb)

## SSB: 24.812

7. Degrees of Freedom

We can set up an ANOVA table to display our results, including the degrees of freedom for each calculation:

# Create dataframe
ANOVA <- data.frame(Source=c('Group', 'Error', 'Total'), SS=c(ssb, ssw, sst), DF=c(4-1, 40-4, 40-1))

# Calculate MSR & MSE
ANOVA$MS <- c(ssb/(4-1), ssw/(40-4), NA)

# Calculate f-statistic
ANOVA$FStat <- c((ssb/(4-1))/(ssw/(40-4)),NA,NA)

# Get critical value
criticalValue <- qf(p = 0.05, df1 = 4-1, df2 = 40-4, lower.tail = FALSE)

# Critical value & F-statistic
message('Critical value: ', criticalValue, '\nF-statistic: ', ANOVA$FStat[1])

## Critical value: 2.86626555094018
## F-statistic: 1.65623122621988

The F-statistic is below the critical value, meaning we cannot reject the null hypothesis. We can also obtain a p-value for this comparison:

# P-value
ANOVA$Pval <- pf(ANOVA$FStat, 4-1, 40-4, lower.tail = FALSE)
knitr::kable(ANOVA)

Source	SS	DF	MS	FStat	Pval
Group	24.812	3	8.270667	1.656231	0.1936809
Error	179.772	36	4.993667	NA	NA
Total	204.584	39	NA	NA	NA

An insignificant p-value also confirms that we cannot reject the null hypothesis.

8. Power

The fact that we only looked at 10 players from each group probably limited this analysis. If we were to sample a greater number of players per group we might be able to obtain more reliable results. However, it is likely that larger groups would violate the assumption of normality that is necessary for the ANOVA test - we would have to use a non-parametric test like the Kruskal–Wallis test.