Statistics | UMAP

UMAP (Uniform Manifold Approximation and Projection) is basically a way to translate high-dimensional data in a lower dimension representation by first creating a topological graph: each point has a local radius determined by it’s nth nearest neighbour, and is made ‘fuzzy’, where radii have a lower likelihood of connection as they get larger.

This is like a weighted graph, with edges representing a likelihood of connection. With this high-dimensional weighted graph, UMAP then creates a lower-dimension version, which it optimises to be as structurally similar to the high-dimensional graph as possible. Because UMAP forces each point to be connected to at least one neighbour (it’s nearest), it represents local structures as well as global structure. However, this means that ‘distance’ between points in a UMAP plot doesn’t really mean anything.

Utility

I really like using UMAP paired with a clustering method like DBSCAN to investigate patterns in high-dimensional data. Here I want to look at whether there are ‘clusters’ of player profiles in the Premier League. The utility of this is seeing which players are similar to each other.

Summarising Data

First I am going to read in some Premier League data from this season!

# Packages
library(tidyverse)
library(extrafont)
library(ggrepel)
library(dplyr)
library(umap)
library(dbscan)

# Read data
Path <- '/Users/alicesmail/Desktop/Programming/GitHubPage/FPL/2024-2025-Data/'
PlayerData <- read_csv(paste0(Path, "FPL-Gameweeks-29.csv"))
head(as.data.frame(PlayerData[6:11]))

##            second_name team team_code team_name team_short_name     web_name
## 1      Ferreira Vieira    1         3   Arsenal             ARS Fábio Vieira
## 2    Fernando de Jesus    1         3   Arsenal             ARS      G.Jesus
## 3 dos Santos Magalhães    1         3   Arsenal             ARS      Gabriel
## 4              Havertz    1         3   Arsenal             ARS      Havertz
## 5                 Hein    1         3   Arsenal             ARS         Hein
## 6               Timber    1         3   Arsenal             ARS     J.Timber

Here I am just using dplyr functions to summarise different metrics for each player across the season so far:

# Summarise data
PlayerDataSum <- PlayerData %>% 
  group_by(web_name, team_short_name, position) %>% 
  summarise(goals_scored=sum(goals_scored), assists=sum(assists), 
            creativity=sum(creativity), xA=sum(expected_assists), 
            xG=sum(expected_goals), influence=sum(influence), 
            threat=sum(threat), minutes=sum(minutes)) %>% 
  unique() %>% filter(minutes>90*10)

# Get statistics per 90 minutes
PlayerDataSum <- PlayerDataSum %>% 
  group_by(web_name, team_short_name, position) %>% 
  summarise(creat90=round((creativity/minutes)*90,2),
            influ90=round((influence/minutes)*90,2),
            thr90=round((threat/minutes)*90,2),
            ast90=round((assists/minutes)*90,2),
            gls90=round((goals_scored/minutes)*90,2),
            xA90=round((xA/minutes)*90,2),
            xG90=round((xG/minutes)*90,2)) %>%
  rename('name'='web_name',
         'team'='team_short_name')

Now I have got a few different per 90 metrics for each player:

# View
PlayerDataSum %>% arrange(-xG90) %>% as.data.frame() %>% head()

##        name team position creat90 influ90 thr90 ast90 gls90 xA90 xG90
## 1   Haaland  MCI      FWD   11.74   32.89 51.10  0.11  0.76 0.06 0.75
## 2   M.Salah  LIV      MID   32.33   50.31 58.45  0.60  0.95 0.24 0.75
## 3      Isak  NEW      FWD   20.67   38.25 42.79  0.22  0.83 0.14 0.68
## 4   Watkins  AVL      FWD   12.66   27.02 42.02  0.26  0.57 0.07 0.60
## 5     Wissa  BRE      FWD   12.04   25.57 35.56  0.13  0.55 0.06 0.58
## 6 N.Jackson  CHE      FWD   16.53   26.03 44.30  0.31  0.47 0.09 0.56

Next I just want to make sure each row has a unique rowname as some players have the same web name.

# Unique rownames
PlayerDataSum <- PlayerDataSum %>% as.data.frame()
row.names(PlayerDataSum) <- paste(make.unique(PlayerDataSum$name))

Creating the UMAP

With this input data I can select the columns I want to input into the UMAP and perform the UMAP!

# Perform UMAP
set.seed(1)
PlayerDataUMAP <- PlayerDataSum %>% 
  select(-c(name, position, team)) %>% 
  select(where(is.numeric)) %>%
  scale() %>%
  umap(preserve.seed=T)

# Add info
PlayerDataUMAPPlot <- PlayerDataUMAP$layout %>% as.data.frame() %>% 
  rename(UMAP1='V1', UMAP2='V2') %>% as.data.frame()
PlayerDataUMAPPlotAll <- merge(PlayerDataUMAPPlot %>% rownames_to_column('ID'), 
                               PlayerDataSum %>% rownames_to_column('ID'))

To visualise the results I can just plot the UMAP coordinates on a scatter plot.

# Palette
palette <- list(colorRampPalette(colors=c('#ba5346', '#cfc963', '#75a450', '#90bdcf'))(4))

# Plot
ggplot(PlayerDataUMAPPlotAll, aes(x=UMAP1, y=UMAP2, colour=position))+
  geom_point(size=4, alpha=0.75)+
  theme_classic()+
  theme(text=element_text(family='Roboto',size=14))+
  scale_colour_manual(values=palette[[1]])+
  labs(colour='Position')

I can see that most midfielders and forwards group together, with some defenders more similar to goalkeepers and others more similar to midfielders.

DBSCAN

Next I can apply DBSCAN clustering to the UMAP coordinates identify similar players.

# DBSCAN
DB <- dbscan(PlayerDataUMAPPlotAll %>% select(UMAP1, UMAP2), eps=0.4, minPts=5)
PlayerDataUMAPPlotAll$Cluster <- DB$cluster

# Palette
palette <- list(colorRampPalette(colors=c('#ba5346', '#cfc963', '#75a450', '#90bdcf'))(max(PlayerDataUMAPPlotAll$Cluster)+1))

# Plot
ggplot(PlayerDataUMAPPlotAll, aes(x=UMAP1, y=UMAP2, colour=as.factor(Cluster)))+
  geom_point(size=4, alpha=0.75)+
  theme_classic()+
  theme(text=element_text(family='Roboto',size=14))+
  scale_colour_manual(values=c('#ebebeb', palette[[1]]))+
  labs(colour='Group')+
  guides(colour='none')

Similar Players

I can now inspect some of the groups! Here, group 6 is made up of central/defensive midfielders like Anderson, Kovačić and Rice, as well as more attacking defenders, like Hall, Robinson, Pedro Porro and Trent Alexander-Arnold.

# Group 6
PlayerDataUMAPPlotAll %>% filter(Cluster==6) %>% 
  select(name, position, team, creat90, influ90, thr90, ast90, gls90, xA90, xG90) %>%
  arrange(-influ90) %>% head(n=10)

##                name position team creat90 influ90 thr90 ast90 gls90 xA90 xG90
## 1  Alexander-Arnold      DEF  LIV   33.27   27.75  9.74  0.25  0.08 0.27 0.06
## 2          Robinson      DEF  FUL   21.57   26.87  6.31  0.35  0.00 0.12 0.02
## 3       Pedro Porro      DEF  TOT   29.42   26.60  7.39  0.20  0.08 0.14 0.06
## 4           Kovačić      MID  MCI   24.16   23.42  8.05  0.11  0.21 0.14 0.08
## 5              Hall      DEF  NEW   23.08   23.10  5.27  0.29  0.00 0.15 0.02
## 6         Tielemans      MID  AVL   29.41   22.75  7.95  0.14  0.07 0.18 0.09
## 7          Anderson      MID  NFO   23.30   22.20 10.06  0.27  0.05 0.14 0.06
## 8          Christie      MID  BOU   23.51   21.88 12.69  0.13  0.09 0.12 0.09
## 9              Rice      MID  ARS   31.82   21.79 10.86  0.21  0.08 0.21 0.08
## 10       Bellegarde      MID  WOL   23.35   20.94 12.18  0.42  0.14 0.17 0.08

Meanwhile group 2 is made up of creative midfielders, especially wingers like Elanga, Jacob Murphy, Amad and Saka, who have high xA and assists.

# Group 2
PlayerDataUMAPPlotAll %>% filter(Cluster==2) %>% 
  select(name, position, team, creat90, influ90, thr90, ast90, gls90, xA90, xG90) %>%
  arrange(-ast90) %>% head(n=8)

##          name position team creat90 influ90 thr90 ast90 gls90 xA90 xG90
## 1        Saka      MID  ARS   44.72   36.64 42.87  0.78  0.35 0.40 0.30
## 2     Savinho      MID  MCI   33.37   20.92 34.42  0.58  0.06 0.31 0.27
## 3         Son      MID  TOT   34.42   29.93 30.71  0.49  0.34 0.23 0.32
## 4 Gibbs-White      MID  NFO   27.18   23.62 22.11  0.45  0.22 0.18 0.18
## 5    J.Murphy      MID  NEW   21.44   24.64 19.44  0.43  0.27 0.20 0.21
## 6 B.Fernandes      MID  MUN   38.69   32.48 18.34  0.42  0.30 0.21 0.32
## 7        Amad      MID  MUN   31.24   29.23 26.98  0.40  0.34 0.19 0.23
## 8      Elanga      MID  NFO   28.65   23.57 19.80  0.39  0.24 0.18 0.19

I can also use this information to identify players with similar playing profiles. Looking at defenders, Kerkez has similar statistics to Aït-Nouri, Wan-Bissaka, Ashley Young and Muñoz - these players have quite high creativity for defenders, with relatively high goals and assists per 90 compared to their xG and xA.

# Which defenders have similar statistics to Kerkez?
PlayerDataUMAPPlotAll %>% 
  filter(Cluster==PlayerDataUMAPPlotAll[grep("Kerkez", PlayerDataUMAPPlotAll$name),]$Cluster & position=='DEF') %>% 
  select(name, position, team, creat90, influ90, thr90, ast90, gls90, xA90, xG90)

##          name position team creat90 influ90 thr90 ast90 gls90 xA90 xG90
## 1   Aït-Nouri      DEF  WOL   14.83   21.69 10.65  0.19  0.11 0.07 0.07
## 2      Kerkez      DEF  BOU   15.88   20.14  6.95  0.21  0.07 0.07 0.02
## 3       Muñoz      DEF  CRY   17.52   22.96 15.24  0.19  0.12 0.09 0.16
## 4 Wan-Bissaka      DEF  WHU   14.60   21.87  5.18  0.11  0.07 0.08 0.04
## 5       Young      DEF  EVE   16.20   21.09  2.67  0.23  0.06 0.11 0.02

Finally, Chris Wood is most similar to several other attackers, including Haaland, Mateta and Wissa. These players all have very high xG and threat, but relatively low xA and creativity (with the exception of Salah!).

# Which strikers have similar statistics to Chris Wood?
PlayerDataUMAPPlotAll %>% 
  filter(Cluster==PlayerDataUMAPPlotAll[grep("Wood", PlayerDataUMAPPlotAll$name),]$Cluster) %>% 
  select(name, position, team, creat90, influ90, thr90, ast90, gls90, xA90, xG90) %>%
  arrange(-gls90) %>% head(n=8)

##      name position team creat90 influ90 thr90 ast90 gls90 xA90 xG90
## 1 M.Salah      MID  LIV   32.33   50.31 58.45  0.60  0.95 0.24 0.75
## 2    Isak      FWD  NEW   20.67   38.25 42.79  0.22  0.83 0.14 0.68
## 3 Haaland      FWD  MCI   11.74   32.89 51.10  0.11  0.76 0.06 0.75
## 4    Wood      FWD  NFO    8.66   28.58 29.27  0.12  0.69 0.04 0.42
## 5    Beto      FWD  EVE    8.61   27.18 43.73  0.00  0.59 0.02 0.52
## 6 Watkins      FWD  AVL   12.66   27.02 42.02  0.26  0.57 0.07 0.60
## 7   Wissa      FWD  BRE   12.04   25.57 35.56  0.13  0.55 0.06 0.58
## 8  Mateta      FWD  CRY   15.09   24.04 26.55  0.08  0.51 0.09 0.48

Summary

Here I have applied UMAP and DBSCAN to identify collections of Premier League players with similar statistics.