M5 Forecasting — Accuracy

Dipanshu Rana
14 min readMay 31, 2021

--

Estimate the unit sales of Walmart retail goods

Table of Contents:

  1. Business Problem
  2. Source of Data
  3. Data Overview
  4. Objective
  5. Performance Metrics
  6. Why ML Models better than Classical Methods
  7. Exploratory Data Analysis
  8. First Cut Solution
  9. Feature Engineering and Data Preprocessing
  10. ML Models Approach
  11. Models Comparison and Kaggle Score
  12. Final Pipeline
  13. Deployment
  14. Future Work
  15. References

1. Business Problem

Department stores like Walmart have uncountable products and money transactions every day.Because of their rapid transaction rates keeping a balance between inventory and customer is most important.Therefore making an accurate sales prediction for different products becomes an essential need for stores to optimize profits.

Previous studies on market sales prediction require a lot of extra information like customer and product analysis. Thus more simplified model is needed by the department store to predict the product sales based on only the historical sales record.

2. Source of Data

It is a Kaggle competition held by University of Nicosia.Data is available here.

3. Data Overview

In this case study we will use hierarchical unit sales data from Walmart which includes price per product, item level, department, product categories and store details.The data covers stores in three US States : California, Texas and Wisconsin.Additional data has also given which include promotions,day of the week and special events.

Files:

  1. calendar.csv : Contains information about the dates on which the products are sold.
  2. sales_train_validation.csv : Contains the historical daily unit sales data per product and store [d_1 to d_1913].
  3. sell_prices.csv : Contains information about the price of the products sold per store and date.
  4. sales_train_evaluation.csv : Includes sales [d_1 to d_1941].
  5. sample_submission.csv : The correct format for submissions.

4. Objective

Our main aim is to predict the products sales for the next 28 days.

5. Performance Metrics

  • RMSE plays important rule when we have to focus more on predictions.
  • RMSE does not treat each error the same.It gives more importance to the most significant errors.
  • That means one big error is enough to get a bad RMSE.

6. Why ML Models better than Classical Methods

  • It is found that classical methods may dominate for one step forecasting.
  • While ML models works better in case of multi-step forecasting.
  • Here we are working on multi-step forecasting thus ML models will be better choice.

7. Exploratory Data Analysis

7.1. Calendar

  • About 8% of days have a special events while remaining 92% of days have no events.
  • Percentage of days where purchases with SNAP are allowed in Walmart stores we find that it is same for each of the 3 states is ‘33%’.
  • Sales fall from Monday to Wednesday and have a rise from Thursday to Saturday with a little fall on Sunday.
  • Sales are higher during weekends as compared to non-weekends.
  • Sales rise from Jan to March after that it falls till June.After June sales rise and fall goes on.
  • March, April, May have highest sales among all months.
  • Sales rise from year 2011 to 2015 with a slight decrease in year 2014.But in year 2016 suddenly there is huge decrease in sales.
  • The sales highest in year 2015.While becomes lowest in year 2016 because data available till 5th month (May) of 2016 only.
  • Data is available till May 2016 only and from 23rd May there is no sales.
  • Sales during Sporting event is slightly more than sales on non events day.
  • Sales on National event are less than the sales on other events and non events days.
  • Among all events days Sales are highest on Sporting event and lowest on national events days.
  • Days with multiple events have highest sales while Days with no events have lowest sales.
  • Overall sales on event day is higher than non event day.

7.2. Department

  • From all above subplots we can observe that days at last of every year have “Zero” sales.It might be because of Christmas day the store remains closed.
  • Sales of FOODS_3 is highest among all Departments.We can observe huge difference in sales of FOODS_3 and rest of departments sales.
  • Sales of HOUSEHOLD_1, FOODS_2 rises gradually from 2011 to 2015. While sales of HOBBIES_1 and FOODS_1 overlaps and almost constant from 2011 to 2015.
  • Sales of HOBBIES_2 is lowest among all departments and almost constant from 2011 to 2016 year.
  • Sales of FOODS_1, FOODS_2, FOODS_3, HOBBIES_1, HOUSEHOLD_1, HOUSEHOLD_2 department after 2015 fall suddenly because we dont have whole 2016 year data.
  • Sales on Saturday,Sunday is higher than on weekdays for each Department.
  • HOOBIES_2 Department have least sales and FOOD_3 Department have highest sales.
  • Sales on weekends are higher than weekdays for each Department
  • Sales on non holiday days are higher than days with holiday for each Department
  • Sales on all the states during the Snap days are high than Non-Snap days.
  • In every state difference between Sales during Snap Days and Non-Snap days is not much.
  • Sales is almost same in every state for Snap days.
  • Sales is almost same in every state for non-Snap days.

State CA :

  • FOODS_2,FOODS_3 : On snap days sales are higher than on non-snap days.
  • FOODS_1,HOBBIES_1, HOUSEHOLD_1,HOUSEHOLD_2 : On snap days sales are slightly higher than on non-snap days (not much difference).
  • HOBBIES_2 : On non-snap days sales are slightly higher than on snap days (almost same).

State TX :

  • FOODS_1,FOODS_2, FOODS_3 : On snap days sales are higher than on non-snap days.
  • HOBBIES_1,HOBBIES_2,HOUSEHOLD_1,HOUSEHOLD_2 : On snap days sales are slightly higher than on non-snap days (not much difference).

State WI :

  • FOODS_1,FOODS_2, FOODS_3 : On snap days sales are higher than on non-snap days.
  • HOBBIES_1,HOBBIES_2,HOUSEHOLD_1, HOUSEHOLD_2 : On snap days sales are slightly higher than on non-snap days (not much difference).

7.3. Category

  • From all above subplots we can observe that days at last of every year have “Zero” sales.It might be because of Christmas day the store remains closed.
  • Sales of FOODS is highest among all three Category.We can observe huge difference in sales of FOODS as compared to sales of HOBBIES, HOUSEHOLD.
  • Sales of HOBBIES is lowest among all three Category.
  • Difference between sales among HOBBIES and HOUSEHOLD is not huge.
  • Sales of HOBBIES is almost constant from 2011 to 2015 year while Sales of HOUSEHOLD raises gradually from 2011 to 2015 year.
  • Sales of every Category after 2015 fall suddenly (because we do not have whole 2016 year data).
  • FOODS Category have highest Revenue.
  • HOBBIES Category have lowest Revenue.
  • Alone FOODS Category contributes aprox. 60% of Revenue.

7.4. Store

  • From all above subplots we can observe that days at last of every year have “Zero” sales.It might be because of Christmas day the store remains closed.
  • Sales of CA_3 store is highest among all Stores.We can observe huge difference in sales of CA_3 as compared to all other Stores.
  • Sales of every Store after 2015 fall suddenly (because we do not have whole 2016 year data).

7.5. State

  • Sales is highest in State CA among all three states.
  • Sales of TX, WI almost overlaps.Before 2014 sales in TX is higher than WI while after 2014 sales of WI becomes higher than TX.
  • Sales in every state rises till year 2015 and after thats falls suddenly in 2016 (because we do not have whole 2016 year data).
  • CA State have highest Revenue while WI State have lowest Revenue.
  • Revenue of WI, TX State is almost same (not much difference).

7.6. Item with Maximum Sales : FOODS_3_090

  • Sales of FOODS_3_090 increases from year 2011 to 2012 and than decreases till year 2014.While remains constatnt from year 2014 to 2015.
  • Sales of FOODS_3_090 decreases from year 2015 to 2016(because we do not have whole 2016 year data).

Year 2011 :

  • From March to 22nd September FOODS_3_090 sales is zero (might be FOODS_3_090 not available during this peroid).
  • After 22nd September FOODS_3_090 sales increases again.

Year 2012 :

  • June have lowest FOODS_3_090 sales

Year 2013 :

  • November have lowest FOODS_3_090 sales.
  • After 12th November FOODS_3_090 sales becomes zero.

Year 2014 :

  • December have lowest FOODS_3_090 sales.Except few days sales remains zero.
  • Some downfall in FOODS_3_090 sales also seen in February.After 6th Feb sales becomes zero.
  • Some downfall in FOODS_3_090 sales also seen in May.After 12th May sales becomes zero.

Year 2015 :

  • December have lowest FOODS_3_090 sales.After 2nd decemeber sales becomes zero.
  • Some downfall in FOODS_3_090 sales also seen in February.After 12th Feb sales becomes zero.
  • Some downfall in FOODS_3_090 sales also seen in May.After 13th May sales becomes zero.

Year 2016 :

  • No Data available after May.After 22nd May FOODS_3_090 sales become zero.

7.7. Sell Price

  • FOODS : Sell price is almost same for every store and lowest among all categories.
  • HOBBIES : Sell price is same for every store and higher than FOODS category.
  • HOUSEHOLD : Sell price is highest among all categories and sometimes sell price goes very high (more than 100).

7.8. Revenue per Year

  • Revenue increases from year 2011 to 2015.
  • Revenue decreases from year 2015 to 2016 it is not because of less sales in 2016 instead due to non availablity of Data (we have data till May 2016 only).

8. First Cut Solution

Moving Average :

  • A moving average is a technique to get an overall idea of the trends in a data set it is an average of any subset of numbers. One can calculate it for any period of time.
  • Here in order to forecast sales we have taken different window size.Let say window size = 14 days than sales on a given day are predicted based on the average sales of the last 14 days.(window : hyper-parameter).
  • After hyper-parameter tuning window =7 days gives the best performance.
  • For best hyper-parameter (window = 7) RMSE score = 3.2802

9. Feature Engineering and Data Preprocessing

Melting:

  • To make analysis of data in table easier we can reshape the data into a more computer-friendly form using pandas in Python. pandas.melt() function is used to unpivots a DataFrame from wide format to long format.

Lags :

  • Lag is expressed in a time unit and corresponds to the amount of data history we allow the model to use when making the prediction.
  • Here we have applied Lags on ‘demand’ column.The maximum Lags taken is 70 days.

Rolling Mean :

  • Rolling means creating a rolling window with a specified size and perform calculations on data in this window which of course rolls through data.
  • Here we have computing Rolling-Mean on ‘demand’ column.The maximum Window size taken is 42.

Label Encoding :

  • Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form.ML algorithms can then decide in a better way on how those labels must be operated.

10. ML Models Approach

  • Taken data after 1000 days (d_1000) so that processing speed will be fast (last approx. 31 months data)
  • Divide data into Train/Test/Validation.
  • Train : Till d_1885, Validation : From d_1886 to d_1914, Test : From d_1914 to d_1941.

10.1. Linear Regression

  • Linear Regression is a machine learning algorithm based on supervised learning.It is used for finding linear relationship between target and one or more predictors.
  • RMSE score = 1.8829

10.2. Decision Tree Regressor

  • Decision tree regression observes features of an object and trains a model in the structure of a tree to predict data in the future to produce meaningful continuous output.
  • Best hyper-parameters max_depth=104, min_samples_split=278, min_samples_leaf=344 after hyper-parameter tuning.
  • For best hyper-parameters RMSE score = 1.8667

10.3. Random Forest Regressor

  • Random forest regressor is an ensemble technique which uses multiple decision trees and a technique called bootstrap and aggregation commonly known as bagging.
  • The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.
  • Best hyper-parameters max_depth=9, min_samples_leaf=2, n_estimators=77 after hyper-parameter tuning.
  • For best hyper-parameters RMSE score = 1.8512

10.4. XGBoost Regressor

  • XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework.
  • Here we made few changes : tree_method =’gpu_hist’ (Equivalent to the XGBoost fast histogram algorithm. Much faster and uses considerably less memory.), grow_policy=’lossguide’ (lossguide : split at nodes with highest loss change.)
  • Best hyper-parameters learning_rate=0.0483, max_leaves=47, min_child_weight=100 after hyper-parameter tuning.
  • For best hyper-parameters RMSE score = 1.8507

10.5. CatBoost Regressor

  • CatBoost is based on gradient boosting.CatBoost implements symmetric trees which helps in decreasing prediction time which is extremely important for low latency environments.
  • Here we made few changes : grow_policy=’Lossguide’, logging_level=”Silent”
  • Lossguide : A tree is built leaf by leaf until the specified maximum number of leaves is reached. On each iteration non-terminal leaf with the best loss improvement is split.
  • Silent : means no output to stdout (except for important warnings).
  • Best hyper-parameters learning_rate=0.0207, num_leaves=94, depth=7 after hyper-parameter tuning.
  • For best hyper-parameters RMSE score = 1.8349

10.6. LightGBM Regressor

  • Light GBM is a gradient boosting framework that uses tree based learning algorithms.
  • Light GBM is prefixed as ‘Light’ because of its high speed.Light GBM can handle the large size of data and takes lower memory to run.
  • Best hyper-parameters learning_rate=0.034, num_leaves=224, max_depth=66 after hyper-parameter tuning.
  • For best hyper-parameters RMSE score = 1.8299

11. Models Comparison and Kaggle Score

  • One can observe that moving average model performs the worst with RMSE = 3.2802 (highest) while LightGBM model performs best among all models with RMSE = 1.8299 (lowest).
  • Linear regression model performs worst than every non linear model.
  • Overall Boosting models performs better than Decision Tree and Random Forest models.
  • LightGBM gives the best score with 0.67749 (private score).There are total 5,558 participants in the competition and my score is at 458 it means on the top 9% on the private leaderboard.

12. Final Pipeline

  • Randomly selected a Single data point for Computation.
  • On selected data point Feature Engineering and Data Preprocessing is done.
  • Train data with the LightGBM model (Best model).
  • Finally predict the sales from day 1942 to 1969.

13. Deployment

  • After training the LightGBM (best model) with best hyper-parameters we stored the model in pickle file and deployed model on my local system along with Flask API built around the final pipeline which takes PRODUCT_ID and STORE_ID as input and returns the forecasted sales in tabular format.
  • Designed a HTML page with two Inputs PRODUCT_ID, STORE_ID and gives the forecasted sales in tabular format for next 28 days (ie. 2016 May 23 to 2016 June 19).

Video explaining the working of Deployed Model :

14. Future Work

One can try following Models :

  • Stacked ensemble model (uses predictions from multiple models to build a new model which is further used for making predictions on the test set).
  • Neural Network (because of loss functions flexibility).
  • LSTM (Long Short-Term Memory).

15. References

  1. https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/163684
  2. https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/163216
  3. https://www.analyticsvidhya.com/blog/2018/02/time-series-forecasting-methods/
  4. https://www.kaggle.com/tarunpaparaju/m5-competition-eda-models
  5. https://mofc.unic.ac.cy/m5-competition/

Complete Project is available on Github. For any queries regarding project contact me on Linkedin.

--

--