It is no surprise that football is one of the world’s most popular sports, and that millions of fans around the world can focus on a single match at the highest level: 22 players go head-to-head to get possession of the ball in front of millions of eyes.
In reality, though, that’s not the whole story when it comes to watching a football game, and if we examine how much data we’re able to glean from a single game, we might just get a hint as to the reason why.
A sub-problem of วิเคราะห์บอล allows me to share my own experience in extracting as much knowledge as possible from broadcast-like video feeds of football matches.
Having A Problem
Even if you could get away with placing multiple fixed cameras around the field, the problem isn’t framed in an easy way, since extracting positional and semantic information from a single moving camera is hard to manipulate. Unfortunately, you probably wouldn’t be permitted to do that on a real stadium given the obvious budget and permission constraints.
There are a variety of ways (at least roughly) to process this video data on a budget and without having to leave your familiar chair.
Taking The Approach
We approached the task like any textbook (good) software engineer would: we decomposed the problem into smaller, more manageable and specific ones.
The division we came up with was as follows:
- Calculate the players’ position based on camera view and projection onto a 2D plane (referencing and homography estimation).
- Identification (i.e., that is, what and where are the players, the ball, and the referee).
- I need to know how to track entities between frames (aka object tracking).
- Is it possible to identify the players across frames? (i.e. how can I identify the players).
- Identifying the team that a player plays for (how do I figure out which team he plays for).
Starting with the overarching architecture of the system, we can then examine the details of the various tasks, from positional to semantic.
We receive a sequence of frames in input, then employ object detection (fields and entities) to process each of the frames sequentially. Once we have a series of nearly consecutive detections, we begin tracking each entity. We also project the position of each entity from frame to pitch coordinates as we estimate the field’s position relative to the camera. Furthermore, by identifying each player and assigning him to a team, we can keep track of his performance.
Following that, we just repeat the video frame by frame until the end. We then proceed to a phase we refer to as smoothing, which entails looking back on the knowledge we have extracted frame by frame, and we do what is called “backward adjustments” in order to make trajectory detections and trajectories more comparable across the whole sequence.
From the moment we feed a frame into the system, we can now follow step-by-step what occurs inside.
Detecting Objects
When considering a problem like this from a ML perspective, the first thing you might notice is that finding labeled data with decent quality is hard. We can therefore turn to one of the most famous object detectors on the market: YoloV3.
If you crop the frame and expect the pre-trained net to deliver good results, then you will be disappointed. We used YOLO to feed the original resolution image to the network using a sliding window since accuracy was highly prioritized over speed here. Using this method, you will be able to distinguish consistently between players/referees, and the ball.