Makridakis Competitions


The Makridakis Competitions are a series of open competitions organized by teams led by forecasting researcher Spyros Makridakis and intended to evaluate and compare the accuracy of different forecasting methods.

Competitions

Summary

No.Informal name for competitionYear of publication of resultsNumber of time series usedNumber of methods testedOther features
1M Competition or M-Competition19821001 15 Not real-time
2M-2 Competition or M2-Competition199329 16 plus 2 combined forecasts and 1 overall averageReal-time, many collaborating organizations, competition announced in advance
3M-3 Competition or M3-Competition2000300324
4M-4 Competition or M4 CompetitionInitial Results 2018, Final 2020100,000All major ML and statistical methods have been testedFirst winner Slawek Smyl, Uber Technologies
5M-5 Competition or M5 Competition2020Around 10,000 hierarchical timeseriesAll major forecasting methods, including Machine and Deep Learning,and Statistical ones will be testedPrizes for the winners

First competition in 1982

The first Makridakis Competition, held in 1982, and known in the forecasting literature as the M-Competition, used 1001 time series and 15 forecasting methods. According to a later paper by the authors, the following were the main conclusions of the M-Competition:
  1. Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones.
  2. The relative ranking of the performance of the various methods varies according to the accuracy measure being used.
  3. The accuracy when various methods are combined outperforms, on average, the individual methods being combined and does very well in comparison to other methods.
  4. The accuracy of the various methods depends on the length of the forecasting horizon involved.
The findings of the study have been verified and replicated through the use of new methods by other researchers.
This is what Rob J. Hyndman, in his paper about "A brief history of time series forecasting competitions", had to say about the first M Competition: “… anyone could submit forecasts, making this the first true forecasting competition as far as I am aware.
Newbold was critical of the M-competition, and argued against the general idea of using a single competition to attempt to settle the complex issue.

Before the first competition, the Makridakis–Hibon Study

Before the first M-Competition, Makridakis and Hibon published in the Journal of the Royal Statistical Society an article showing that simple methods perform well in comparison to the more complex and statistically sophisticated ones. Statisticians at that time criticized the results claiming that they were not possible. Their criticism motivated the subsequent M, M2 and M3 Competitions that prove beyond the slightest doubt those of the Makridakis and Hibon Study.

Second competition, published in 1993

The second competition, called the M-2 Competition or M2-Competition, was conducted on a larger scale. A call to participate was published in the International Journal of Forecasting, announcements were made in the International Symposium of Forecasting, and a written invitation was sent to all known experts on the various time series methods. The M2-Competition was organized in collaboration with four companies and included six macroeconomic series, and was conducted on a real-time basis. Data was from the United States. The results of the competition were published in a 1993 paper. The results were claimed to be statistically identical to those of the M-Competition.
The M2-Competition used much fewer time series than the original M-competition. Whereas the original M-competition had used 1001 time series, the M2-Competition used only 29, including 23 from the four collaborating companies and 6 macroeconomic series. Data from the companies was obfuscated through the use of a constant multiplier in order to preserve proprietary privacy. The purpose of the M2-Competition was to simulate real-world forecasting better in the following respects:
The competition was organized as follows:
In addition to the published results, many of the participants wrote short articles describing their experience participating in the competition and their reflections on what the competition demonstrated. Chris Chatfield praised the design of the competition, but said that despite the organizers' best efforts, he felt that forecasters still did not have enough access to the companies from the inside as he felt people would have in real-world forecasting.
Fildes and Makridakis argue that despite the evidence produced by these competitions, the implications continued to be ignored by theoretical statisticians.

Third competition, published in 2000

The third competition, called the M-3 Competition or M3-Competition, was intended to both replicate and extend the features of the M-competition and M2-Competition, through the inclusion of more methods and researchers and more time series. A total of 3003 time series was used. The paper documenting the results of the competition was published in the International Journal of Forecasting in 2000 and the raw data was also made available on the International Institute of Forecasters website. According to the authors, the conclusions from the M3-Competition were similar to those from the earlier competitions.
The time series included yearly, quarterly, monthly, daily, and other time series. In order to ensure that enough data was available to develop an accurate forecasting model, minimum thresholds were set for the number of observations: 14 for yearly series, 16 for quarterly series, 48 for monthly series, and 60 for other series.
Time series were in the following domains: micro, industry, macro, finance, demographic, and other. Below is the number of time series based on the time interval and the domain:
Time interval between successive observationsMicroIndustryMacroFinanceDemographicOtherTotal
Yearly146102835824511645
Quarterly2048333676570756
Monthly474334312145111521428
Other400290141174
Total8285197313084132043003

The five measures used to evaluate the accuracy of different forecasts were: symmetric mean absolute percentage error, average ranking, median symmetric absolute percentage error, percentage better, and median RAE.
A number of other papers have been published with different analyses of the data set from the M3-Competition. According to Rob J. Hyndman, Editor-in-Chief of the International Journal of Forecasting, “The M3 data have continued to be used since 2000 for testing new time series forecasting methods. In fact, unless a proposed forecasting method is competitive against the original M3 participating methods, it is difficult to get published in the IJF.”

Fourth competition, started in Jan 1 2018, ended in May 31 2018.

The M-Competitions have attracted great interest in both the academic world and among practitioners, providing objective evidence of the most appropriate way of forecasting various variables of interest. The fourth competition, M4, was announced in November 2017. The competition started in Jan 1 2018 and ended in May 31 2018. Initial results were published in the International Journal of Forecasting on June 21, 2018.
The M4 extended and replicated the results of the previous three competitions, using an extended and diverse set of time series to identify the most accurate forecasting method for different types of predictions. It aimed to get answers on how to improve forecasting accuracy and identify the most appropriate methods for each case. To get precise and compelling answers, the M4 Competition utilized 100,000 real-life series, and incorporates all major forecasting methods, including those based on Artificial Intelligence, as well as traditional statistical ones.
In his blog, Rob J. Hyndman said about M4: “The “M” competitions organized by Spyros Makridakis have had an enormous influence on the field of forecasting. They focused attention on what models produced good forecasts, rather than on the mathematical properties of those models. For that, Spyros deserves congratulations for changing the landscape of forecasting research through this series of competitions."
Below is the number of time series based on the time interval and the domain:
Time interval between successive observationsMicroIndustryMacroFinanceDemographicOtherTotal
Yearly65383716390365191088123623000
Quarterly6020463753155305185886524000
Monthly10975100171001610987572827748000
Weekly1126411642412359
Daily14764221271559106334227
Hourly00000414414
Total2512118798194022453487083437100000

In order to ensure that enough data are available to develop an accurate forecasting model, minimum thresholds were set for the number of observations: 13 for yearly, 16 for quarterly, 42 for monthly, 80 for weekly, 93 for daily and 700 for hourly series.
One of its major objectives was to compare the accuracy of ML methods versus that of statistical ones and empirically verify the claims of the superior performance of ML methods.
Below is a short description of the M4 Competition and its major findings and Conclusion:
The M4 Competition ended on May 31, 2018 and in addition to point forecasts, it included specifying Prediction Intervals too. M4 was an Open one, with its most important objective : “to learn to improve forecasting accuracy and advance the field as much as possible”. This is in contrast to other ones, as those organized by Kaggle, where there is actually a “horse race” aimed at identifying the most accurate forecasting method without attempting to discover the reasons involved in order to be able to improve forecasting performance in the future.
The five major findings and the conclusion of M4:
Below we outline what we consider to be the five major findings of the M4 Competition and advance a logical conclusion from these findings.
  1. The combination of methods was the king of the M4. Out of the 17 most accurate methods, 12 were “combinations” of mostly statistical approaches.
  2. The biggest surprise, however, was a “hybrid” approach utilizing both Statistical and ML features. This method, produced the most accurate forecasts as well as the most precise PIs and was submitted by Slawek Smyl, Data Scientist at Uber Technologies. According to sMAPE, it was close to 10% more accurate than the Combination benchmark of the Competition. It is noted that in the M3 Competition the best method was 4% more accurate than the same Combination.
  3. The second most accurate method was a combination of seven statistical methods and one ML one, with the weights for the averaging being calculated by a ML algorithm, trained to minimize forecasting error through holdout tests. This method was jointly submitted by Spain’s University of A Coruña and Australia’s Monash University.
  4. The first and the second most accurate methods also achieved an amazing success in specifying correctly the 95% PIs. These are the first methods we know that have done so and do not underestimate uncertainty considerably.
  5. The six pure ML methods submitted in the M4 performed poorly, none of them being more accurate than Comb and only one being more accurate than Naïve2. These results are in agreement with those of a recent study we published in PLOS ONE.
The conclusion from the above findings is that the accuracy of individual statistical or ML methods is low and that hybrid approaches and combination of methods is the way forward in order to improve forecasting accuracy and make forecasting more valuable.
The five Machine Learning methods submitted in the M4 performed poorly, none of them being more accurate than the statistical benchmark and only one being more accurate than Naïve 2, a finding consistent with a paper published in PLOS ONE at the end of March 2018
.

Fifth competition, to start March 2 2020, and end June 30 2020.

M5, the latest of the M Competitions, will run from March 2, to June 30, 2020. It will use real-life data from Walmart and will be run on Kaggle’s Platform. It will offer substantial prizes totalling $100,000 to the winners. The data is provided by Walmart and consist of around 100,000 hierarchical daily time series, starting at the level of SKUs and ending with the total demand of some large geographical area. In addition to the sales data, there is also be information about prices, advertising/promotional activity and inventory levels as well as the day of the week the data refers to.
There will be several major prizes for the first, second and third winners in the categories of
There will also be student and company prizes. There will be no limit to the number of prizes that can be won by a single participant or team.
The focus of the M5 is mainly practitioners rather than academics. Makridakis expects that the M5 Competition will attract more than 2,000 participants and teams given the substantial prize money and public interest.

M5 Conference

Following the M5 Competition, there will be an M5 Forecasting Conference to be held in New York in December 2020, where its findings will be presented together with the description of the most accurate methods and firms as well as suggestions of how what has been learned from the competition can be applied to other firms. Finally, there will also be a special issue of the International Journal of Forecasting exclusively devoted to the M5 Competition/Conference focusing on how what has been learned can be disseminated and applied to as wide audience as possible. In addition to the papers describing the best methods, there would also be articles from practitioners and academics, commentaries and suggestions of how future competitions can be improved.
References
More information on the M4 Competition is available on the M4 website - http:/www.m4.unic.ac.cy – and a Special Issue covering all aspects of the M4, the winning methods and comment will be published in the in 2019.

Offshoots

NN3-Competition

Although the organizers of the M3-Competition did contact researchers in the area of artificial neural networks to seek their participation in the competition, only one researcher participated, and that researcher's forecasts fared poorly. The reluctance of most ANN researchers to participate at the time was due to the computationally intensive nature of ANN-based forecasting and the huge time series used for the competition. In 2005, Crone, Nikolopoulos and Hibon organized the NN-3 Competition, using 111 of the time series from the M3-Competition. The NN-3 Competition found that the best ANN-based forecasts performed comparably with the best known forecasting methods, but were far more computationally intensive. It was also noted that many ANN-based techniques fared considerably worse than simple forecasting methods, despite greater theoretical potential for good performance.

Reception

In books for mass audiences

, in his book The Black Swan, references the Makridakis Competitions as follows: "The most interesting test of how academic methods fare in the real world was provided by Spyros Makridakis, who spent part of his career managing competitions between forecasters who practice a "scientific method" called econometrics -- an approach that combines economic theory with statistical measurements. Simply put, he made people forecast in real life and then he judged their accuracy. This led to a series of "M-Competitions" he ran, with assistance from Michele Hibon, of which M3 was the third and most recent one, completed in 1999. Makridakis and Hibon reached the sad conclusion that "statistically sophisticated and complex methods do not necessarily provide more accurate forecasts than simpler ones.""
In the book Everything is Obvious, Duncan Watts cites the work of Makridakis and Hibon as showing that "simple models are about as good as complex models in forecasting economic time series."