Statistics | Logistic Regression

Logistic regression can be applied to categorical variables that are associated with different predictors: these predictors may be other categorical variables or continuous variables. A logistic regression model can be applied to assess the effect of each predictor on the probability that each data point belongs to a specific category - this can then be used to predict which category each data point in a set of unlabeled will fall into.

As an example, various metrics like xG or creativity, could be used to predict what position a player is classified as in the Premier League.

Opening Packages & Downloading Data

Before trying to apply a logistic regression model, I need to load the packages I want to use (ggplot2 & dplyr, which are part of the tidyverse package), and download some data from vaastav’s Fantasy-Premier-League GitHub page.

The data I will use as a training set is from the 2023-24 season - I can then use the logistic regression model trained on this data to predict the position of players in the 2024-25 season using their xGI data.

Preparing the Data

Here I create two groups of players: attackers (MID & FWD position) and defenders (DEF). Then I filter out GKs, and summarise the total xGI for each player across the entire season.

The plot below shows the xGI and position for each player: in general, attackers have a higher xGI, as expected.

# Plot player data
ggplot(PlayerPos, aes(x=total_xGI, y=position, colour=position, fill=position))+
  geom_violin(alpha=0.5, size=1) +
  geom_jitter(alpha=0.5, size=2, height=0.2) +
  theme_classic() +
  scale_colour_manual(values=c('#1c7ad9', '#d91c3c'))+
  scale_fill_manual(values=c('#1c7ad9', '#d91c3c'))+
  scale_size_continuous(range=c(5,15)) +
  theme(text=element_text(family='Franklin Gothic Book', size=14),
        panel.grid=element_blank())+
  guides(colour='none', fill='none') +
  labs(x='Total xGI', y='Position')

Logistic Regression

Next I can factorise the data labels (player position), and create a logistic regression model of position & xGI.

The model summary shows that xGI has a significant effect on the classification of players as defenders or attackers. In fact, a unit increase of 1 xGI decreases the chance that a player is a defender by 23%.

Predicting Player Positions

To predict the position of players in 2024-25, I need to get the most recent data from Vaastav’s repository and prepare it in the same way as the training data.

Now I can apply the trained model to the test 2024-25 data, and predict the position of each player.

I can see that 173 attackers were identified correctly, while 48 defenders were identified correctly. From these numbers I can calculate the accuracy in training and test datasets.

The accuracy on both datasets is aaround ~60%. I could also plot a ROC curve, to see the relationship between the false positive and true positive rate.

The closer an AUC value is to 1, and the more a ROC plot increases into the top left corner, the better the logistic regression model. From the ROC curve above and an AUC (area under the curve) value of 0.685, we can see that the model is relatively good at predicting player position, but it could be improved. For example, we could add more predictors.

Prepare Data

This time, instead of just looking at total xGI, I want to add threat, creativity and element statistics for each player, to make a better prediction model for player position.

Logistic Regression

Now I can create a logistic regression model that involves xGI, threat, creativity and element statistics.

Using the model trained on 2023-24 data, we can predict the positions of players in the 2024-25 season.

The accuracy on both datasets is around ~70%, which is higher than the previous model that just used xGI. I can also plot a ROC curve, to see the relationship between the false positive and true positive rate.

Here the AUC is 0.75 which is higher than the AUC for the previous model, meaning that when I also add creativity, threat and element, the model is better at predicting player position.

Attacking Defenders

I could also use the output of the test dataset predictions to identify defenders that play an attacking role in their team. To do this, I could get the defenders that have the highest probability of being assigned an attacker using the logistic regression model.

# Add labels
PlayerPos25$label1 <- ifelse(PlayerPos25$name %in% c('Nathan Collins','Lucas Digne'), PlayerPos25$name, NA)
PlayerPos25$label2 <- ifelse(PlayerPos25$name %in% c('Gabriel dos Santos Magalhães','Ethan Pinnock'), PlayerPos25$name, NA)
PlayerPos25$label3 <- ifelse(PlayerPos25$name %in% c('Nicolas Jackson'), PlayerPos25$name, NA)
PlayerPos25$label4 <- ifelse(PlayerPos25$name %in% c('Trent Alexander-Arnold', 'Kai Havertz'), PlayerPos25$name, NA)
PlayerPos25$label5 <- ifelse(PlayerPos25$name %in% c('Erling Haaland'), PlayerPos25$name, NA)
PlayerPos25$label6 <- ifelse(PlayerPos25$name %in% c('Mohamed Salah'), PlayerPos25$name, NA)

# Edit labels 
PlayerPos25$label2 <- gsub('Gabriel dos Santos Magalhães', 'Gabriel', PlayerPos25$label2)

# Plot DEF vs ATT
ggplot(PlayerPos25, aes(x=PredProbabilityT, y=1, colour=position))+
  
  # Points
  geom_point(size=5, alpha=0.6)+ 
  
  # Theme
  theme_classic()+
  theme(panel.grid=element_blank(), axis.line.y=element_blank(), 
        axis.text.y=element_blank(), axis.ticks.y=element_blank(),
        text=element_text(family='Franklin Gothic Book', size=12),
        plot.title = element_text(hjust = 0.5)) +
  
  # Add labels and re-position
  geom_text_repel(aes(label=PlayerPos25$label4), family='Franklin Gothic Book', 
                  size=3, colour='black', box.padding=0, nudge_y=0.05, nudge_x=-0.035, segment.curvature=1)+
  geom_text_repel(aes(label=PlayerPos25$label1), family='Franklin Gothic Book', 
                  size=3, colour='black', box.padding=0, nudge_y=0.035)+
  geom_text_repel(aes(label=PlayerPos25$label2), family='Franklin Gothic Book', 
                  size=3, colour='black', box.padding=0, nudge_y=-0.035)+
  geom_text_repel(aes(label=PlayerPos25$label3), family='Franklin Gothic Book', 
                  size=3, colour='black', box.padding=0, nudge_y=-0.045)+
  geom_text_repel(aes(label=PlayerPos25$label5), family='Franklin Gothic Book', 
                  size=3, colour='black', box.padding=0, nudge_y=0.05,nudge_x=0.035, segment.curvature=-1)+
  geom_text_repel(aes(label=PlayerPos25$label6), family='Franklin Gothic Book', 
                  size=3, colour='black', box.padding=0, nudge_y=0.035, segment.curvature=-1)+
  
  # Colour and labels 
  scale_colour_manual(values=c('#1c7ad9', '#d91c3c'))+
  guides(colour='none') +
  labs(x='Probability of Being Correctly Identified as an Attacker', y=NULL, title='Classifying Attackers & Defenders from GW1-10') +

  # Axis limits  
  scale_y_continuous(limits=c(0.95,1.05))+
  scale_x_continuous(limits=c(0.2,1.05)) +
  
  # Annotations
  geom_vline(xintercept=0.6, linetype='dashed') +
  annotate("segment", x=0.62, y=0.96, xend=0.68, yend=0.96, arrow = arrow(length = unit(0.1, "cm")), colour='#1c7ad9') +
  annotate("text", x=0.73, y=0.96, label='Attacking attackers', family='Franklin Gothic Book', size=3, colour='#1c7ad9') +
  annotate("segment", x=0.58, y=1.04, xend=0.52, yend=1.04, arrow = arrow(length = unit(0.1, "cm")), colour='#d91c3c') +
  annotate("text", x=0.47, y=1.04, label='Attacking defenders ', family='Franklin Gothic Book', size=3, colour='#d91c3c')