Statistics | Logistic Regression

Logistic regression can be applied to categorical variables that are associated with different predictors: these predictors may be other categorical variables or continuous variables. A logistic regression model can be applied to assess the effect of each predictor on the probability that each data point belongs to a specific category - this can then be used to predict which category each data point in a set of unlabeled will fall into.

As an example, various metrics like xG or creativity, could be used to predict what position a player is classified as in the Premier League.

Opening Packages & Downloading Data

Before trying to apply a logistic regression model, I need to load the packages I want to use (ggplot2 & dplyr, which are part of the tidyverse package), and download some data from vaastav’s Fantasy-Premier-League GitHub page.

The data I will use as a training set is from the 2023-24 season - I can then use the logistic regression model trained on this data to predict the position of players in the 2024-25 season using their xGI data.

# Packages
library(tidyverse)
library(extrafont)
library(ggrepel)
library(pROC)
github <- "https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2023-24/"
filenames <- sprintf("gw%s",seq(1:38))
PlayersList <- lapply(filenames, function(x) {
  data <- subset(read_csv(url(paste0(github, "gws/",x,".csv"))), minutes>0)
  return(data)}) 
PlayerData <- data.frame(data.table::rbindlist(PlayersList))

Preparing the Data

Here I create two groups of players: attackers (MID & FWD position) and defenders (DEF). Then I filter out GKs, and summarise the total xGI for each player across the entire season.

# Replace FWD/MID with ATT
PlayerData$position <- gsub('MID', 'ATT', PlayerData$position)
PlayerData$position <- gsub('FWD', 'ATT', PlayerData$position)

# Select relevant columns
PlayerPos <- PlayerData %>% filter(minutes>60) %>% 
  filter(position!='GK') %>%
  select(name, position, expected_goal_involvements) %>% 
  group_by(name, position) %>% 
  summarise(total_xGI = sum(expected_goal_involvements))

The plot below shows the xGI and position for each player: in general, attackers have a higher xGI, as expected.

Logistic Regression

Next I can factorise the data labels (player position), and create a logistic regression model of position & xGI.

# Turn categorical data into factors
PlayerPos$position <- as.factor(PlayerPos$position)

# Logistic regression model of sex, age, heteroplasmy on binary phenotype
logisticModel <- glm(position ~ total_xGI, family = binomial(), PlayerPos)

# Results of regression
summary(logisticModel)

# Get OR from coefficient
(1-exp(-0.26462))*100
## 
## Call:
## glm(formula = position ~ total_xGI, family = binomial(), data = PlayerPos)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.35083    0.14215   2.468   0.0136 *  
## total_xGI   -0.26462    0.04222  -6.267 3.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 593.45  on 441  degrees of freedom
## Residual deviance: 526.94  on 440  degrees of freedom
## AIC: 530.94
## 
## Number of Fisher Scoring iterations: 5
## 
## [1] 23.25025

The model summary shows that xGI has a significant effect on the classification of players as defenders or attackers. In fact, a unit increase of 1 xGI decreases the chance that a player is a defender by 23%.

Predicting Player Positions

To predict the position of players in 2024-25, I need to get the most recent data from Vaastav’s repository and prepare it in the same way as the training data.

# Download data
github <- "https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2024-25/"
filenames <- sprintf("gw%s",seq(1:10))
PlayersList <- lapply(filenames, function(x) {
  data <- subset(read_csv(url(paste0(github, "gws/",x,".csv"))), minutes>0)
  return(data)}) 
PlayerData25 <- data.frame(data.table::rbindlist(PlayersList))
# Replace FWD/MID with ATT
PlayerData25$position <- gsub('MID', 'ATT', PlayerData25$position)
PlayerData25$position <- gsub('FWD', 'ATT', PlayerData25$position)

# Select relevant columns
PlayerPos25 <- PlayerData25 %>% filter(minutes>60) %>% 
  filter(position!='GK') %>%
  select(name, position, expected_goal_involvements) %>% 
  group_by(name, position) %>% 
  summarise(total_xGI = sum(expected_goal_involvements))

Now I can apply the trained model to the test 2024-25 data, and predict the position of each player.

# Applying the model to predict probability of each position
PredProbs <- predict(logisticModel, PlayerPos25, type='response')

# Convert probability to category
PlayerPos25$PredPos <- ifelse(PredProbs >=0.5, 'DEF', 'ATT')

# Applying the model to predict probability of each position
PlayerPos$PredPos <- ifelse(logisticModel$fitted.values >=0.5, 'DEF', 'ATT')

# Create classification tables to see how well the model does at classifying players
ctabTrain <- table(PlayerPos$position, PlayerPos$PredPos)
ctabTest <- table(PlayerPos25$position, PlayerPos25$PredPos)
ctabTest
##      
##       ATT DEF
##   ATT  83 112
##   DEF  15 112

I can see that 173 attackers were identified correctly, while 48 defenders were identified correctly. From these numbers I can calculate the accuracy in training and test datasets.

# Accuracy in training dataset
accuracyTrain <- sum(diag(ctabTrain)/sum(ctabTrain))*100
accuracyTest <- sum(diag(ctabTest)/sum(ctabTest))*100
accuracyTrain
accuracyTest
## [1] 59.50226
## [1] 60.55901

The accuracy on both datasets is aaround ~60%. I could also plot a ROC curve, to see the relationship between the false positive and true positive rate.

# Calculate ROC
roc <- roc(PlayerPos25$position, PredProbs)
roc

# Plot ROC
ggroc(roc) + 
  theme_bw()+
  theme(panel.grid=element_blank(),
        text=element_text(family='Franklin Gothic Book', size=14)) +
  geom_abline(slope=1, intercept=1, linetype='dashed')

## 
## Call:
## roc.default(response = PlayerPos25$position, predictor = PredProbs)
## 
## Data: PredProbs in 195 controls (PlayerPos25$position ATT) < 127 cases (PlayerPos25$position DEF).
## Area under the curve: 0.685

The closer an AUC value is to 1, and the more a ROC plot increases into the top left corner, the better the logistic regression model. From the ROC curve above and an AUC (area under the curve) value of 0.685, we can see that the model is relatively good at predicting player position, but it could be improved. For example, we could add more predictors.

Prepare Data

This time, instead of just looking at total xGI, I want to add threat, creativity and element statistics for each player, to make a better prediction model for player position.

# 2023-24 data
PlayerPos <- PlayerData %>% filter(minutes>60) %>% 
  filter(position!='GK') %>%
  select(name, position, expected_goal_involvements, threat, creativity, element) %>% 
  group_by(name, position) %>% 
  summarise(total_xGI = sum(expected_goal_involvements),
            total_threat = sum(threat),
            total_creativity=sum(creativity),
            total_element=sum(element))

# 2024-25 data
PlayerPos25 <- PlayerData25 %>% filter(minutes>60) %>% 
  filter(position!='GK') %>%
  select(name, position, expected_goal_involvements, threat, creativity, element) %>% 
  group_by(name, position) %>% 
  summarise(total_xGI = sum(expected_goal_involvements),
            total_threat = sum(threat),
            total_creativity=sum(creativity),
            total_element=sum(element))

Logistic Regression

Now I can create a logistic regression model that involves xGI, threat, creativity and element statistics.

# Turn categorical data into factors
PlayerPos$position <- as.factor(PlayerPos$position)

# Logistic regression model of sex, age, heteroplasmy on binary phenotype
logisticModel <- glm(position ~ total_xGI+total_creativity+total_threat+total_element, family = binomial(), PlayerPos)

# Results of regression
summary(logisticModel)
## 
## Call:
## glm(formula = position ~ total_xGI + total_creativity + total_threat + 
##     total_element, family = binomial(), data = PlayerPos)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.017e-01  1.651e-01  -0.616   0.5379    
## total_xGI        -4.386e-01  1.774e-01  -2.473   0.0134 *  
## total_creativity  1.572e-03  1.168e-03   1.347   0.1780    
## total_threat     -1.709e-03  2.160e-03  -0.791   0.4289    
## total_element     1.550e-04  2.838e-05   5.461 4.73e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 593.45  on 441  degrees of freedom
## Residual deviance: 483.22  on 437  degrees of freedom
## AIC: 493.22
## 
## Number of Fisher Scoring iterations: 6

Using the model trained on 2023-24 data, we can predict the positions of players in the 2024-25 season.

# Applying the model to predict probability of each position
PredProbs <- predict(logisticModel, PlayerPos25, type='response')

# Convert probability to category
PlayerPos25$PredPos <- ifelse(PredProbs >=0.5, 'DEF', 'ATT')

# Applying the model to predict probability of each position
PlayerPos$PredPos <- ifelse(logisticModel$fitted.values >=0.5, 'DEF', 'ATT')

# Create classification tables
ctabTrain <- table(PlayerPos$position, PlayerPos$PredPos)
ctabTest <- table(PlayerPos25$position, PlayerPos25$PredPos)

Now we can calculate the accuracy in training and test datasets.

# Accuracy in training dataset
accuracyTrain <- sum(diag(ctabTrain)/sum(ctabTrain))*100
accuracyTest <- sum(diag(ctabTest)/sum(ctabTest))*100
accuracyTrain
accuracyTest
## [1] 72.85068
## [1] 68.63354

The accuracy on both datasets is around ~70%, which is higher than the previous model that just used xGI. I can also plot a ROC curve, to see the relationship between the false positive and true positive rate.

roc <- roc(PlayerPos25$position, PredProbs)
roc
ggroc(roc) + 
  theme_bw()+
  theme(panel.grid=element_blank(),
        text=element_text(family='Franklin Gothic Book', size=14)) +
  geom_abline(slope=1, intercept=1, linetype='dashed')

## 
## Call:
## roc.default(response = PlayerPos25$position, predictor = PredProbs)
## 
## Data: PredProbs in 195 controls (PlayerPos25$position ATT) < 127 cases (PlayerPos25$position DEF).
## Area under the curve: 0.7511

Here the AUC is 0.75 which is higher than the AUC for the previous model, meaning that when I also add creativity, threat and element, the model is better at predicting player position.

Attacking Defenders

I could also use the output of the test dataset predictions to identify defenders that play an attacking role in their team. To do this, I could get the defenders that have the highest probability of being assigned an attacker using the logistic regression model.

# Add probabilities to dataframe
PlayerPos25$PredProbability <- PredProbs

# Transform so defenders are inverse
PlayerPos25$PredProbabilityT <- ifelse(PlayerPos25$position=='ATT', 1-PlayerPos25$PredProbability, PlayerPos25$PredProbability)

# Get top 5 most attacking players
PlayerPos25 %>% as.data.frame() %>% 
  arrange(-PredProbabilityT) %>% head(n=5) %>% 
  select(name, PredProbabilityT)
##                       name PredProbabilityT
## 1           Erling Haaland        0.9930109
## 2            Mohamed Salah        0.9657008
## 3              Kai Havertz        0.9512516
## 4          Nicolas Jackson        0.9499096
## 5 Dominic Solanke-Mitchell        0.9464745
# Get top 5 most attacking defenders
PlayerPos25 %>% as.data.frame() %>% 
  filter(position=='DEF') %>% arrange(PredProbabilityT) %>% 
  head(n=5) %>% select(name, PredProbabilityT)
##                           name PredProbabilityT
## 1               Nathan Collins        0.2513318
## 2 Gabriel dos Santos Magalhães        0.2548458
## 3       Trent Alexander-Arnold        0.3325852
## 4                Ethan Pinnock        0.3356420
## 5                  Lucas Digne        0.3450896

Below is a plot summarising the results of my analysis.