M5-Forecasting -Accuracy Top 10% solution

6 min readMar 14, 2021

Overview :-

M5 Forecasting Accuracy is a competition which is hosted by Kaggle and the dataset is made available by Walmart. In this competition, we have to forecast future sales of each product in each store based on the hierarchical sales data provided by Walmart. We use this dataset for EDA (Exploratory Data Analysis) and training Machine Learning (ML) models. These ML models is then used to predict future sales. Next step is to evaluate how each model performed by submitting their predicted values to Kaggle. After submitting, a score is given based on the metrics used in competition to evaluate the results. Key goal is to attain the score in top 10%.
After couple of iterations (typical with model building), my model scored in top 10%. I am sharing my approach and some useful learning in this engagement.

Steps :-

Define the problem.
Fetching the data .
Data Preprocessing.
Data Analysis.
Feature Engineering.
Train different Machine Learning models.
Predict the values from every model separately.
Submit the results and see which model gives best value.
Improve your model again to get better scores.
Future improvements :- Use Deep Learning.

Let’s shed some light on these steps :-

1. Define the problem :-

M5 Forecasting Accuracy is a competition in which we have to forecast future sales of each product in each store based on the hierarchical sales data provided by Walmart. In this competition we have to forecast daily sales for next 28 days. Here we have the data for 3 states in US(California, Texas, and Wisconsin). The data files (.csv files) provided for the competition consists of item level, department, product categories, items sold on a day, store details, price, promotions, day of the week, and special events. So by using this data we will forecast daily sales for next 28 days as accurately as possible.

2. Fetching the data :-

Now we will get the data . We can either directly download the data or we can use the api provided in data section of competition . As I have done all my work in Google’s Colab , so I used the api to get the data.

api

3. Data Preprocessing :-

In this step we will look at the data and then try to convert it into a format which we will use for process .

Here we have 5 .csv files :-

We will read these 3 files as dataframes in our .ipynb:-

Then we have defined a function called downcast which will help us to save memory, so we will apply it on each dataframe. We will also add new columns in sales so that we can get test dataset later on.

Now we will change the format of sales . We will convert from wide to long format using melt .

Wide:- d_1,d_2… are columns .

Long :- d_1,d_2… are rows.

This conversion is done so that sales could be further merged with sell prices and calendar .After merging we get a dataframe on which we will perform data analysis.

4. Data Analysis :-

In this I have performed data analysis and data visualization to get some insights from the data.

Here are few examples of graphs that I have plotted with their observations :-

5. Feature Engineering :-

This is the one of the most creative process in machine learning . In this process we try to come up with new features with the help of existing features . Not all of the new features are helpful, but we have to find useful features which will help the model to perform better.

There are many techniques used in feature engineering . Some of the techniques that I have used are mentioned below :-

After this we will get a dataframe which we will use to train the models.

6. Train different Machine Learning models :-

In this step I have used different models and also used different techniques while training those models to get predictions for validation and test datasets.

I have created a dataframe df_final which consists of some of the features / columns . These columns were chosen after many iterations of training and submitting the result.

Then I created validation and test datasets which I will use in predicting the values for submission.

After that comes the part where I train the models using this data. Data before ‘d’ < 1942 is used for training .

7. Predict the values from every model separately :-

In this step we will predict the values for validation data and test data.

8. Submit the results and see which model gives best value :-

In this step we will then convert these predictions from each model to in a format in which we can submit our predictions . For every model this step is carried out separately.

After submitting the predictions we will get scores which is actually an error. This error that we are getting is calculated by using metric Weighted Root Mean Squared Scaled Error (WRMSSE). The lesser the score better the model’s performance.

9. Improve your model again to get better scores :-

In this step we will perform some hyperparameter tuning and will try to remove/add some features to see if we can get better score.

Best Model :-

The model training technique that gave me the best result was LGBM Regressor with CV(Cross Validation) fold . In this technique I performed a cross validation technique called TimeSeriesSplit . This CV technique is used for timeseries forecasting data.