A detailed description of the organization along with the environment in which the project has been carried out and an overview of project undertaken is outlined. The tools and technologies used in the development of the system are also discussed in this chapter.
1.1 ORGANIZATION PROFILE
Noodle Analytics, Inc. develops enterprise artificial intelligence solutions for collaboration among business executives, process experts, and artificial intelligence technologies.
Noodle Analytics, Inc. was founded in 2016 and is headquartered in San Francisco, California.
The core of Noodle.ai is advanced mathematical modelling. Powered by very high dimensional non-linear regressions, classifications, neural networks, and association, Noodle.ai is able to find the optimal operating plan among billions of possibilities.
Learning algorithms are distinct from static rules-based software. They learn and improve over time as they process more data. Getting the right data sources is key to enterprise-class AI. Noodle.ai augments your internal company data with the Noodle.ai data cartridges—curated external data streams.
Enter the BEAST, the Noodle.ai AI-optimized high-performance computing platform. Crunching the volume and types of data used to train enterprise algorithms was impossible even a few years ago. The BEAST has petaflops of computing power to enable rapid training and validation of a wide breadth of data science hypotheses.
AI systems are powerful only if applied to the right business problems. Noodle.ai believes operations planning & execution is the highest value area to which these techniques may be applied.
Typically, the company provides its solutions to solve complex business challenges and drive improvements in the areas of business process optimization, artificial intelligence technologies, human-centred design, east-west collaboration, and global agile methods, as well as customer, product, and enterprise operations.
It also offers advisory services in the areas of enterprise pathways, data value assessment, internal skills development, and architecture.
1.1.1 Team Profile
The team has good knowledge on machine learning algorithms, develop predictive models, understand the “theory” – maths and stats behind the models and can interpret and explain model behaviour in jargon-free language.
The team is well versed with programming language as Python and R for predictive data analysis.
1.2 NOODLE NOTEBOOKS
At noodle, for a team to solve any data science problem requires a reasonable amount of groundwork to be done essentially on data analysis, feature engineering, data wrangling, modelling and signal detection. This forces the teams to dawdle doing these, which reasonably prolong the project completion time.
The main objective of this project is to develop generic solutions for the preliminary work which enables data scientists to concentrate more on core solution development.
1.3 SYSTEM CONFIGURATION
The system environment used for the development of the project is described.
1.3.1 Hardware Specification
The hardware environment in which the project is carried out has been detailed. ` Processor : Intel® Core™ i5-7300U CPU @ 2.60GHz 2.70 GHz
Hard Disk : 1 TB
RAM : 8 GB
1.3.2 Software Specification
The software environment (software tools, languages and the operating system) in which the project has carried out has been detailed.
Operating System : Windows 10
Tools : GitHub, Anaconda, JupyterHub, PyCharm
IDE : PyCharm Community Edition 2018.2.4
Language : Python
1.4 TOOLS AND TECHNOLOGIES USED
The tools and technologies used for the development of the project are described.
Python is a widely used high-level programming language for general-purpose programming, created by Guido van Rossum and first released in 1991. An interpreted language, Python has a design philosophy that emphasizes code readability (notably using whitespace indentation to delimit code blocks rather than curly brackets or keywords), and syntax that allows programmers to express concepts in fewer lines of code than might be used in languages such as C++ or Java. The language provides constructs intended to enable writing clear programs on both a small and large scale.
Python features a dynamic type system and automatic memory management and supports multiple programming paradigms, including object-oriented, imperative, functional programming, and procedural styles. It has a large and comprehensive standard library.
Python interpreters are available for many operating systems, allowing Python code to run on a wide variety of systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation.
Many programmers nowadays opt for Python to build software applications with concise, clean, and readable code base. They can even accelerate custom software application development by taking advantage of a number of integrated development environments (IDEs) for Python. PyCharm is one of the most widely used IDEs for Python programming language. At present, the Python IDE is being used by large enterprises like Twitter, Pinterest, HP, Symantec and Groupon.
JetBrains has developed PyCharm as a cross-platform IDE for Python. In addition to supporting versions 2.x and 3.x of Python, PyCharm is also compatible with Windows, Linux, and macOS. At the same time, the tools and features provided by PyCharm help programmers to write a variety of software applications in Python quickly and efficiently. The developers can even customize the PyCharm UI according to their specific needs and preferences.
Also, they can extend the IDE by choosing from over 50 plug-ins to meet complex project requirements.
In a world where data is being generated at such an alarming rate, the correct analysis of that data at the correct time is very useful. One of the most amazing frameworks to handle big data in real-time and perform analyses is Apache Spark, and if we talk about the programming languages being used nowadays for handling complex data analysis and data munging tasks, Python will top this chart.
Apache Spark is a fast cluster computing framework which is used for processing, querying and analysing big data.
Being based on in-memory computation, it has an advantage over several other big data frameworks.
Originally written in the Scala programming language, the open source community has developed an amazing tool to support Python for Apache Spark. PySpark helps data scientists interface with RDDs in Apache Spark and Python through its library Py4j. There are many features that make PySpark a better framework than others:
• Speed: It is 100x faster than traditional large-scale data processing frameworks.
• Powerful Caching: Simple programming layer provides powerful caching and disk persistence capabilities.
• Deployment: Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manager.
• Real Time: Real-time computation and low latency because of in-memory computation.
• Polyglot: Supports programming in Scala, Java, Python, and R.
1.4.4 Jupyter Notebook
Notebook documents (or “notebooks”, all lower case) are documents produced by the Jupyter Notebook App, which contain both computer code (e.g. python) and rich text elements (paragraph, equations, figures, links, etc…). Notebook documents are both human-readable documents containing the analysis description and the results (figures, tables, etc..) as well as executable documents which can be run to perform data analysis.
The Jupyter Notebook App is a server-client application that allows editing and running notebook documents via a web browser. The Jupyter Notebook App can be executed on a local desktop requiring no internet access or can be installed on a remote server and accessed through the internet.
In addition to displaying/editing/running notebook documents, the Jupyter Notebook App has a “Dashboard” (Notebook Dashboard), a “control panel” showing local files and allowing to open notebook documents or shutting down their kernels.
A notebook kernel is a “computational engine” that executes the code contained in a Notebook document. The ipython kernel executes python code. Kernels for many other languages exist.
When a Notebook document is opened, the associated kernel is automatically launched. When the notebook is executed (either cell-by-cell or with menu Cell -; Run All), the kernel performs the computation and produces the results. Depending on the type of computations, the kernel may consume significant CPU and RAM.
NOODLE NOTEBOOKS OVERVIEW
Noodle notebooks provides facilities to generate a complete data science project workflow.
A data science workflow provides a lifecycle to structure the development of the data science projects. The lifecycle outlines the steps, from start to finish, that projects usually follow when they are executed.
2.1 PROBLEM DEFINITION
The main objective of this project is to develop generic solutions for data analysis, feature engineering, data wrangling, modelling and signal detection which enables data scientists to concentrate more on core solution development.
The noodle notebooks are needed to overcome the below listed enigma in solving the data science problems:
• Lack of consistency – Different platforms, languages and dependency issues.
• Redundancy of code across many POD teams.
• Lack of reusability, implying two teams were unable to use same code serving the same purpose.
• Low abstraction – A data scientist needed more knowledge on various tools
• More time consumed in solving a POD project
• No horizontal reusability of modules.
• No structured documentation available.
2.3 DATA SCIENCE WORKFLOW PHASES
There are five main phases, preparation of the data, alternating between running the analysis and reflection to interpret the outputs, and finally dissemination of results in the form of written reports and/or executable code.
2.3.1 Exploratory Data Analysis
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets summarize their main characteristics, often with visual methods.
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to:
• maximize insight into a data set
• uncover underlying structure
• extract important variables
• detect outliers and anomalies
• test underlying assumptions
• develop parsimonious models
• determine optimal factor settings
The EDA approach is precisely that–an approach–not a set of techniques, but an attitude/philosophy about how a data analysis should be carried out.
EDA is not identical to statistical graphics although the two terms are used almost interchangeably. Statistical graphics is a collection of techniques–all graphically based and all focusing on one data characterization aspect. EDA encompasses a larger venue; EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. EDA is not a mere collection of techniques; EDA is a philosophy as to how we dissect a data set; what we look for; how we look; and how we interpret. It is true that EDA heavily uses the collection of techniques that we call “statistical graphics”, but it is not identical to statistical graphics per se.
Most EDA techniques are graphical in nature with a few quantitative techniques. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing the data to reveal its structural secrets, and being always ready to gain some new, often unsuspected, insight into the data. In combination with the natural pattern-recognition capabilities that we all possess, graphics provides, of course, unparalleled power to carry this out.
The graphical techniques employed in EDA are often quite simple, consisting of various techniques of:
• Plotting the raw data (such as data traces, histograms, probability plots, lag plots, block plots, and Youden plots).
• Plotting simple statistics such as mean plots, standard deviation plots, box plots, and main effects plots of the raw data.
• Positioning such plots to maximize our natural pattern-recognition abilities, such as using multiple plots per page.
2.3.2 Data Wrangling
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.
This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, “munging” the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.
The data transformations are typically applied to distinct entities (e.g. fields, rows, columns, data values etc.) within a data set, and could include such actions as extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating and filtering to create desired wrangling outputs that can be leveraged downstream.
The recipients could be individuals, such as data architects or data scientists who will investigate the data further, business users who will consume the data directly in reports, or systems that will further process the data and write it into targets such as data warehouses, data lakes or downstream applications.
In 2011, researchers from Stanford University and UC Berkeley published a paper entitled Wrangler: Interactive Visual Specification of Data Transformation Scripts. In it, the authors described a research project called Wrangler, which was “an interactive system for creating data transformations.” Wrangler introduced a new way to perform data wrangling through direct interaction with data presented in a visual interface. Analysts could interactively explore, change and manipulate the data and immediately see results. Wrangler tracked the user’s data transformations and could then automatically generate code or scripts that could be applied repeatedly on other datasets.
In 2012, several of the authors (Kandel, Hellerstein, Heer) went on to found Trifacta, which is a commercialization of the software in the Wrangler project. Since then, a number of other companies have developed products. to perform data wrangling. These include both commercial and freely available offerings.
There are typically six iterative steps that make up the data wrangling process.
1. Discovering: Before you can dive deeply, you must better understand what is in your data, which will inform how you want to analyse it. How you wrangle customer data, for example, may be informed by where they are located, what they bought, or what promotions they received.
2. Structuring: This means organizing the data, which is necessary because raw data comes in many different shapes and sizes. A single column may turn into several rows for easier analysis. One column may become two. Movement of data is made for easier computation and analysis.
3. Cleaning: What happens when errors and outliers skew your data? You clean the data. What happens when state data is entered as CA or California or Calif.? You clean the data. Null values are changed, and standard formatting implemented, ultimately increasing data quality.
4. Enriching: Here you take stock in your data and strategize about how other additional data might augment it. Questions asked during this data wrangling step might be: what new types of data can I derive from what I already have or what other information would better inform my decision making about this current data?
5. Validating: Validation rules are repetitive programming sequences that verify data consistency, quality, and security. Examples of validation include ensuring uniform distribution of attributes that should be distributed normally (e.g. birth dates) or confirming accuracy of fields through a check across data.
6. Publishing: Analysts prepare the wrangled data for use downstream – whether by a user or software – and document any particular steps taken or logic used to wrangle said data. Data wrangling gurus understand that implementation of insights relies upon the ease with which it can be accessed and utilized by others.
2.3.3 Feature Engineering
A feature is an attribute or property shared by all the independent units on which analysis or prediction is to be done. Any attribute could be a feature, if it is useful to the model.
The purpose of a feature, other than being an attribute, would be much easier to understand in the context of a problem. A feature is a characteristic that might help when solving the problem.
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning.
The features in your data are important to the predictive models you use and will influence the results you are going to achieve. The quality and quantity of the features will have great influence on whether the model is good or not.
The process of feature engineering:
• Brainstorming or Testing features
• Deciding what features to create
• Creating features
• Checking how the features work with your model
• Improving your features if needed
• Go back to brainstorming/creating more features until the work is done
Automation of feature engineering has become an emerging topic of research in academia. In 2015, researchers at MIT presented the Deep Feature Synthesis algorithm and demonstrated its effectiveness in online data science competitions where it beat 615 of 906 human teams. Deep Feature Synthesis is available as an open source library called Featuretools. That work was followed by other researchers including IBM’s OneBM and Berkeley’s ExploreKit. The researchers at IBM state that feature engineering automation “helps data scientists reduce data exploration time allowing them to try and error many ideas in short time. On the other hand, it enables non-experts, who are not familiar with data science, to quickly extract value from their data with a little effort, time and cost.”
2.3.4 Time Series Modelling
Time series modeling is a dynamic research area which has attracted attentions of researcher’s community over last few decades. The main aim of time series modeling is to carefully collect and rigorously study the past observations of a time series to develop an appropriate model which describes the inherent structure of the series. This model is then used to generate future values for the series, i.e. to make forecasts. Time series forecasting thus can be termed as the act of predicting the future by understanding the past. Due to the indispensable importance of time series forecasting in numerous practical fields such as business, economics, finance, science and engineering, etc., proper care should be taken to fit an adequate model to the underlying time series. It is obvious that a successful time series forecasting depends on an appropriate model fitting. A lot of efforts have been done by researchers over many years for the development of efficient models to improve the forecasting accuracy. As a result, various important time series forecasting models have been evolved in literature.
A time series in general is supposed to be affected by four main components, which can be separated from the observed data. These components are: Trend, Cyclical, Seasonal and Irregular components.
The general tendency of a time series to increase, decrease or stagnate over a long period of time is termed as Secular Trend or simply Trend. Thus, it can be said that trend is a long-term movement in a time series. Seasonal variations in a time series are fluctuations within a year during the season. The cyclical variation in a time series describes the medium-term changes in the series, caused by circumstances, which repeat in cycles. Irregular or random variations in a time series are caused by unpredictable influences, which are not regular and do not repeat in a pattern.
2.3.5 Signal Detection
In statistics and signal processing, signal detection is the process of finding abrupt changes in the mean level of a time series or signal.
In business applications the project managers should know if an outlier represents an error. Or are there specific reasons they should be concerned of (if undesired) or excited about. In research and statistical modeling projects outliers impact model performance. So, they are removed during model fitting to enhance prediction accuracy.
In terms of definition, an outlier is an observation that significantly differs from other observations of the same feature. If a time series is plotted, outliers are usually the unexpected spikes or dips of observations at given points in time. A temporal dataset with outliers have several characteristics:
? There is systematic pattern (which is deterministic) and some variation (which is stochastic)
? Only a few data points are outliers
? Outliers are significantly different from the rest of the data
2.3.6 Performance Evaluation
Evaluating the performance of a model is a fundamental aspect of machine learning. Evaluation method is the yardstick to examine the efficiency and performance of any model. The evaluation is important for understanding the quality of the model or technique, for refining the parameters in the iterative process of learning and for selecting the most acceptable model or technique from given set of models or techniques. There are several criteria for evaluating models for different tasks and other criteria that can be important as well, such as computational complexity or the comprehensibility of the model.
NOODLE NOTEBOOKS DESIGN AND IMPLEMENTATION
Noodle notebooks consists python notebooks (API like notebooks) and python modules where each notebook is designed with few standard principles to be followed, data Science modelling of time series data and horizontal reusability of these notebooks/libraries. Updating the documentation for every module to elucidate the working of each method to the user.
Pynotebooks are python notebooks which consists of modules to perform data analysis, data wrangling, feature engineering, modelling and signal detection on any datasets.
3.1.1 Data Analysis
Data analysis module contains the process of evaluating data using analytical and logical reasoning to examine each component of the data provided.
18.104.22.168 Univariate Analysis
Analysis on data contains only one independent variable. Different kinds of plots are available to summarize the data. Histograms are used to estimate the distribution of continuous variables. Distribution plots are suitable for comparing the distribution of given data with their expected values. Figure 3.1 depicts an example for histogram plot.
22.214.171.124 Bivariate Analysis
Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.
Bivariate analysis can be helpful in testing simple hypotheses of association. Bivariate analysis can help determine to what extent it becomes easier to know and predict a value for one variable (possibly a dependent variable) if we know the value of the other variable (possibly the independent variable). Figure 3.2 is an example for bivariate analysis.
126.96.36.199 Multivariate Analysis
Multivariate analysis is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. Figure 3.3 gives a correlation network for multivariate analysis.
3.1.2 Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning and is both difficult and expensive.
This notebook imports user defined python package for feature engineering which implements various methods such as log transformation, tanh transformation, lag shifting, groupby, aggregate, box-cox transformation, wavelet transformation, clustering, concatenation, converting categorical to one-hot encoding, normalization and dimensionality reduction. Figure 3.4 shows how a method call is done to perform feature engineering.
3.1.3 Data Wrangling
Data wrangling module contains data cleaning and data wrangling components. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
In data cleaning, methods such as imputation, scaling and scaling are implemented. In data wrangling methods such as append, merging, stacking, filtering regex, intersection, union and unstacking are implemented.
The modelling modules implements several clustering techniques such as agglomerative clustering, gaussian clustering and KMeans clustering.
Agglomerative hierarchical clustering is a bottom-up clustering method where clusters have sub-clusters, which in turn have sub-clusters, etc. In Gaussian Mixture Models component distributions are Gaussians. K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. Example of optimal gaussian clustering is shown in Figure 3.5
3.1.5 Signal Detection
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behaviour, called outliers. It has many applications in business, from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and from fraud detection in credit card transactions to fault detection in operating environments.
Generally, anomalies can be classified into three categories:
1. Point anomalies
2. Contextual anomalies
3. Collective anomalies
Anomaly detection for time-series data is considered, which falls under Contextual anomalies. Business use case: Spending $100 on food every day during the holiday season is normal but may be odd otherwise.
Anomaly detection is done on timeseries data, by converting timeseries into a smaller latent vector and finding reconstruction loss between the reconstructed output and the input. The higher the reconstruction loss, the more chances of it being an anomaly. The output of anomaly detection is how in Figure 3.6
3.2 DEMAND FORECASTING APPLICATION – SALESAI
AI is destined to be the perfect tool to fuel organizations’ sales efforts and power sales teams with genuinely intelligent tools to more effectively organize their work and sell more.
Technology is fully embedded in today’s sales organizations and it goes beyond just a database of customer or prospect information. SalesAI contains submodules that explicitly work for supply and demand data. The modules are built in such a way that it can handle a bunch of time series under same data individually.
3.2.1 Data Analysis
The data analysis modules hold methods for time series analysis. The tech can be used to analyse a business’ data and to provide actionable information to help those at the top make better decisions.
188.8.131.52 Summary Statistics
Summary statistics gives a quick description of the data. For a time series data, the summary statistics is an important aspect to get a basic idea about the dataset.
184.108.40.206 Trend Analysis
Trend Analysis is the review of historical results to detect patterns. This module plots box-plots for complete data and for multiple time series depending on the input given. The graph is saved in the user-defined location with the file name that contains unique trend. Figure 3.7 depicts an example for trend analysis.
220.127.116.11 Coverage Exploration
Coverage Exploration package allows user to get the total revenue/unit sales for different product groups and their percentage revenue/unit sales distribution across data.
Basically, it is the subset of data based on total revenue/unit sales or based on percentage distribution across data allows user to get better understanding of the data. It allows to get an insight into the performance of multiple time series across data set. Knowing the best performing time series can help to create appropriate flow to handle data.
18.104.22.168 Periodicity Analysis
A cyclic pattern exists when data exhibit rises and falls that are not of fixed period. A seasonal pattern exists when the data is influenced by seasonal factors. Seasonality is always of a fixed and known period. Hence, seasonal time series are sometimes called periodic time series.
Analysis of periodicity of time series data using power spectrum and multiple boxplots to extract meaningful statistical components from the data. Figure 3.8 shows the power spectrum of the data from which the seasonality is identified.
3.2.2 Data Wrangling
Often ‘raw’ data can be hard, even impossible, to analyse and gain useful insights from. This is where somebody will transform the data entries, fields, rows and columns into a more useful format.
In this module, it prepares the data for a dedicated purpose – taking the data from its raw state and transforming and mapping into another format such as imputing missing dates etc, normally for use beyond its original intent.
Imputation refers to replacing missing values in the dataset often in the form of NaN with Mean, Median etc. Presence of missing values in the dataset often leads to false predictions. Multiple Time Series Imputation refers to applying imputation techniques across multiple time series simultaneously.
The notebook allows user to identify the dominant frequency of an individual time series or for the whole dataset, presence of any missing values in date/time column and in the whole dataset and impute them.
3.2.3 Feature Engineering
For each time series various features must be mined which includes lag features, date related features, other business-related features.
22.214.171.124 Time Series Transformations
The purpose transformations is to simplify the patterns in the historical data by removing known sources of variation or by making the pattern more consistent across the whole data set. Transformations of historical data can often lead to a simpler forecasting task.
At once, any one of the transformations listed below in the table can be used. The transformation name and arguments pertaining to the model is initialized as input parameters in the notebook.
126.96.36.199 Time Series Classification
Time series should be classified into smooth, intermittent, Lumpy, erratic based on the calculated cv2 and P values for each time series and a histogram of number of time series under each category should be plotted. An example distribution of time series data for each class is shown in Figure 3.9.
188.8.131.52 Generating Calendar Features
Date-time features refers to extracting various attributes out of date column to get a better understanding of the input date/time column. These attributes help to get a detailed view of the input date column.
This module allows user to generate a set of possible date-time features by default by passing an empty list or a custom list containing specified features out of possible features.
Standalone components for each time series forecasting technique are built. All the possible scenarios which might occur due to nature of time series has to be handled inside the model routines.
Time series forecasting describes predicting the observation at the next time step (future). This is also known as a one-step forecast. When multiple time steps need to be predicted, a multiple-step or a multi-step time series forecasting needs to be done.
Forecasting involves taking models which are fit on historical data and using those models to predict future observations.
184.108.40.206 Baseline Forecast
A baseline forecast is an estimate of future demand that is based on historical demand. It is important because it elevates your forecast above the status of a guess. When you use a baseline, you recognize that your best guide to what happens next is often what happened before. Baselines allow us to define success as doing better than the baseline.
Properties of a technique for making a baseline forecast are:
• Simple: A method that requires little or no training or intelligence.
• Fast: A method that is fast to implement and computationally trivial to make a prediction.
• Repeatable: A method that is deterministic, meaning that it produces an expected output given the same input.
Various simple time series forecasting models are implemented which are listed in the Figure 3.10.
220.127.116.11 Forecast Ensemble
Forecast ensemble module can perform forecasting using models such as ARIMA, STL Decomposition, Prophet and simple exponential smoothing. Forecast results from these modules are combined to the final forecast. This ensemble method can have any number of models. Each model is given a weight which is a function of its validation error.
ARIMA models are, in theory, the most general class of models for forecasting a time series which can be made to be “stationary” by differencing (if necessary), perhaps in conjunction with nonlinear transformations such as logging or deflating (if necessary).
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships.
The simplest of the exponentially smoothing methods is naturally called simple exponential smoothing (SES). This method is suitable for forecasting data with no clear trend or seasonal pattern.
A major disadvantage of a linear combination technique is that it considers only the contributions of the individual models, but totally overlooks the possible relationships among them. As a result, there is a considerable reduction in the forecasting accuracy of a linear combination scheme, when two or more participating models in the ensemble are correlated. To overcome this limitation, our ensemble technique is developed as an extension of the usual linear combination to deal with the possible correlations between pairs of forecasts.
And, parameter tuning module is included as part of the ensemble method. The grid search method was for non-sklearn models were built. Parameter tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.
18.104.22.168 Multivariate Time Series Forecasting
Multivariate Time Series Forecasting is a forecasting module that uses external features also into consideration to forecast the data.
Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems.
The simplest approach is to predict the value at the next time (t+1) given the value at the previous time (t-1). Shifting the dataset by 1 creates the t-1 column, adding a NaN (unknown) value for the first row. The time series dataset without a shift represents the t+1.
The addition of lag features is called the sliding window method, in this case with a window width of 1.
The window width can be expanded and include more lagged features. The first few rows that do not have enough data need to be discarded. Lag features are created for both the dependent and external variable.
Rolling window statistics are used to calculate summary statistics across the values in the sliding window and include these as features in the dataset. Perhaps the most useful is the mean of the previous few values, also called the rolling mean. First, the series must be shifted. Then the rolling dataset can be created, and the mean values calculated on each window of two values. Summary statistics, specifically the minimum, mean, and maximum value in the window can be calculated.
Using the above generated features, the model is fit on deep neural network/RandomForest/XgBoost model.
22.214.171.124 Feature Attribution
It is important in many applications to understand what features are important for a model, and why individual predictions were made. For tree ensemble methods these questions are usually answered by attributing importance values to input features, either globally or for a single prediction. Here it shows that current feature attribution methods are inconsistent, which means changing the model to rely more on a given feature can decrease the importance assigned to that feature. To address this problem SHAP (SHapley Additive exPlanation) values, which were recently shown to be the unique additive feature attribution method based on conditional expectations that is both consistent and locally accurate.
Given a few time-series, exploration of the predicted values becomes essential. Performance evaluation module helps to evaluate the predictions from a forecasting model with respect to the actual.
Routines to compute error metrics such as Mape, Umape, etc either at combination level or for the entire set of available combinations should be written. Figure 3.11 gives the mathematical description of every metric computed.
The performance metrics are calculated for individual time series. Based on the metrics top K good fits and bad fits are plotted to analyse the group which has performed well, and which has not. So that the time series in the bad fits are concentrated more. An example for a bad fit is shown in Figure 3.12.
The distribution of error is also plotted to visualize the overall performance of the model. Figure 3.13 is an example for the error distribution plot.
3.3 DESIGN PRINCIPLES
The design Principles are set of rules to be followed used to organize or arrange the structural elements of module.
Again, the way in which these principles are applied affects the expressive content, or the message of the work.
3.3.1 Structure of Notebooks
• Lower layer: All classes/functions written in underneath module
• Upper layer: Access lower layer through notebook
Figure 3.14 shows the visualization of the two-layer approach
Stand-alone scripts invocation possible: Even the python packages can be directly utilized without using it through notebooks.
Importing the python packages directly is shown in Figure 3.15
Sphinx is a documentation generator written and used by the Python community. It is written in Python and used in other environments.
This means that it takes a bunch of source files in plain text, and generates a bunch of other awesome things, mainly HTML. For our use case you can think of it as a program that takes in plain text files in reStructuredText format, and outputs HTML. So as a user of Sphinx, your main job will be writing these text files. This means that you should be minimally familiar with reStructuredText as a language. It’s like Markdown in a lot of ways. It’s a lot more powerful than Markdown, but with that power comes increased complexity. Just know that some of the awkward syntax allows you to do more interesting things further down the line. It is extensible: it has a formal way of adding markup directives that allow more sophisticated parsing. For example, Sphinx includes directives to relate documentation of modules, classes and methods to the corresponding code. Figure 3.16 shows an example of generated sphinx documentation.