Can science predict the results of football?

A predictive method allows predicting the score of future soccer matches. This model is based on machine learning, a discipline of Artificial Intelligence.

Sport is a physical activity that the human being performs mainly for recreational purposes and / or to improve physical condition. Throughout the history of mankind, sport has had a great influence on culture and the construction of a national identity. In addition, it has a positive impact on the development of a country on issues such as education, economics and public health.

Given the attention that certain sports disciplines have in society, sport has become one of the world's great businesses and has proved to be an important part of economic growth. Of all the sports that exist, many experts agree that soccer is the most popular in the world, since it triggers a great movement of money in bets, sponsorships, attendance at parties, sale of shirts, accessories, and so on. That is why a great interest in building predictive and statistical models has been awakened using a football prediction website.

In addition, the large amount of information available such as match results, investments made, player characteristics, etc., allow finding mechanisms that provide competitive advantages.

The proposal
There are several factors that have an impact on the outcome of a soccer game, such as the morale and skills of the team and / or player, the training strategy and the equipment. All this makes the prediction of the outcome of a game complex, even for experts in the area.

It also generates interesting research questions when you want to obtain a general prediction model that can be applied to any soccer game. For example: What impact do the rules of each league have on the result? Is it the same to predict a regular league game, a league tournament game or an international tournament? How is it possible to achieve a good prediction knowing only the results of previous games?

To answer these questions, it was proposed to predict the results of soccer games of 52 leagues around the world using around 200 thousand results of previous games of regular league matches (does not include league tournaments). The same prediction model would be used to predict the group stage of the Soccer World Cup organized by FIFA.

The construction of the model is based on machine learning, also known as Machine Learning. This discipline in the field of Artificial Intelligence creates systems that learn automatically. Learning in this context means identifying complex patterns in millions of data. Through the identification of patterns, it is possible to build prediction or classification models.

In this proposal we use a model supported by two axes, the first is a Bayesian model based on the ranking of each team and the second is based on the history shared by the teams in dispute.

Ranking Team
ranking is performed when the season has been completed. For this, the FIFA ranking method is used, combining the total of goals scored and received throughout the season.

When a new team joins the league, it can pay the adaptation price. The chosen way to balance the score is to take into account the history of points that veteran teams bring. In such a way that the score of a veteran team is 20% of your previous score and 80% of your current score.

The model
The Bayesian model uses the position in the rank of each team in dispute to obtain a probability of success or failure. Subsequently, with the use of random variables generated with a triangular distribution, an adjustment measure is obtained to recalculate the probability to win or lose. When the difference of probabilities in the teams is less than 10%, a tie is declared in the result.

The prediction also takes into account the history shared by the teams in dispute. In the analysis of the historical data it was detected that, in some leagues, the teams that face constantly develop result patterns that are independent of the position in which they are ranked.

In this way, the complete model considers this pattern of behaviour in conjunction with the team's ranking to determine a probability of winning, losing or drawing a match.

The forecast
To make the forecast for regular league a database with approximately 200 thousand results of soccer games from 52 leagues around the world was used. The database contains information on the season, league, date of the match, home team, visiting team and match result. The prediction of the day to be carried out for each league after the last date registered in the database was made.

To make the forecast for the World Cup, the FIFA score list of each qualified team in the World Cup was used, as well as the team information by group and date of the matches. The forecast was made for the group elimination phase, in which three rounds of matches are played between the members of each team. The two teams with the best scores in each group qualify for the next phase.

The results
To measure the effectiveness of the prediction, the Ranked Probability Score (RPS) was used. This measure penalizes the forecasts more severely when their probabilities are farther from the actual result. The value obtained with this measurement is within the range of 0-1. Zero being the most desired value.

Another measure used is the absolute accuracy of the forecast. With this measure, the percentage of success that was taken in the prediction is verified.

The following graph shows the prediction result by leagues. The average RPS obtained is 0.2620 while the accuracy is 46%. These results are competitive compared to the state of the art. The graph shows the average result by league obtained. The size of the circle indicates the number of predictions made. As can be seen, most of the results tend to be in the upper left of the graph, which indicates greater accuracy and little prediction error.

The result of the World Cup is shown in the following graph, the bars indicate the average accuracy obtained in the predictions of the games in the three rounds of games per group, the average result is 0.48. The orange line indicates the RPS obtained for the prediction of matches, obtaining an average of 0.276, finally the blue line indicates the accuracy of the teams that are selected for phase 2, obtaining an average of 0.68.

In conclusion
The main motivation in this work is the opportunity to test forecast models in a topic as popular as soccer is. Despite the lack of knowledge about football in general, we were able to first understand the challenge and then develop a prediction model that is easy to implement.

Each league is driven by different motivations that influence the outcome of a game, this can hinder the recognition of patterns when only the result of previous matches is known, however, the model allows recognizing useful patterns for prediction.

In this development, most of the time was invested in defining the best way to classify and accommodate the data, as well as to program the procedures, trying to make them as efficient as possible.

The proposed methodology is simply an instance of a more general framework applied to football. Although, in principle, the framework can be adapted to a wide range of sports domains, it cannot be used in domains that have insufficient data.

Another approach to explore in the future is a knowledge-based system. This usually requires knowledge of a relatively good quality, while most machine learning systems need a large amount of data to get good predictions.

It is important to understand that each football league behaves according to a particular environment. Therefore, a better prediction model should include particular characteristics of the game of the game, such as the importance of the game. This could help improve prediction accuracy.

Future work in this area includes the development of a model that tries to predict the score of the match, along with more advanced techniques and the use of different parameters to evaluate the quality of the result.