Fundamentals Of Statistical Modelling

Knowledge Base & Community Wiki

Fundamentals Of Statistical Modelling

in

Driving In The Dark Without Headlights – Modelling system behaviour and system performance has traditionally been a black art, practiced by experienced statisticians or researchers with doctorate’s or PhD’s. That has changed a fair bit over the years with the advent of commercial products however the reality today is that modelling & forecasting still continues to baffle that the average IT professional making its many benefits in-accessible to the IT professional in general. Many commercial product vendors have attempted to change that over the course of the years through the complex and expensive simulation offerings with very limited success. Look around your own IT organization, dig a big deeper into how it manages systems through the design, build and support phase across the IT Delivery Life Cycle and you will realize what we are talking about. Ask your system development or IT lead about the performance/scalability limitations of the platform i.e. what validation (performance testing, benchmarking, etc.) was done and it’s highly likely that you will struggle to get a clear picture of what the current situation is. It’s rarer for organizations to have capability to model and forecast system behaviour using complex commercial modelling & forecasting tools.

Most organizations and teams are in the habit of building complex systems and making large IT investments with little knowledge of what the underlying investments are able to sustain in terms of growth volumes. Build now, test later has been the traditional mantra and that old habit is changing slowly. Things might be slightly different with some of the new generation front end customer driven web applications, but dig a bit deeper and you’ll find all the traditional challenges with the back office or core business applications that support the main business processes within any large organization.

We argue that in today’s economy where the only constant is change your business needs to have the ability to look into the future and predict what impact change in course will have on the ability of the system to sustain on-going business operations. Not having that insight is not an option any more. Besides, building capability to validate your system performance and using that information to model / forecast system behaviour for growth is workload is a business necessity especially when it comes to core business applications. In this piece we will cover the technicality associated with Data collection from Linux / Unix systems for purposes of modelling system behaviour and eventually system performance.

Why Model System Behaviour & System Performance – Let’s look a few examples of why system modelling is really beneficial and relate that to how we manage our investments in large and complex IT systems.

Let’s say you worked for cruise line and were responsible for investment in building of a large cruise vessel to be able to cater to increase in customer demand on one of your busiest international passenger routes. Knowing what ship’s cost to build these days, would you consider making that investment in a given model or design without having your designers test out the model incrementally i.e. try out a computer simulation of the proposed model first, then try out a scaled down real world simulation (akin to a wind tunnel test for aeroplanes) and once proven embark on building of the real large ship? Would you consider making large investments in a particular design or concept, no matter how efficient or break through the design might be without having first estimated what the performance of the model would look like first in a computer simulation and then potentially in a real world scaled down model test.

Let’s also consider an organization in charge of building a large and expensive bridge across a natural water system. Bridges are definitely man made wonders and have to battle the forces of nature through the life time. After geologists have mapped the terrain and architects have drawn up their plans the next course of action would be to put the design through a simulation test on a computer followed by a real world wind tunnel test simulation some of the toughest conditions mother nature would subject it to. Without such an approach you would risk building a structure that is unable to stack up to the tough environment conditions it will be subjected to over the course of its lifetime. You wouldn’t dream of jumping directly from the architecture stage to the design stage without simulating the behaviour of the bridge under various real life conditions.

Investments in building your core IT systems follow the same lines. Before you make large investment in any large IT application or system would you now want to ask the question of your IT vendor or IT team responsible for building the system i.e. does the proposed design for the system stack up with regards to the different Non Functional Requirements (NFR’s) and what modelling has been done before hand to prove that the system designs are capable of meeting the expected Non Functional Requirements (NFR’s).

One of the reasons why businesses love SaaS based offerings is because they can’t seem to get similar levels of efficiency/commitment with regards to application performance, scalability, availability, functionality, etc. from their internal IT teams which they are able to get from their SaaS based vendors. This is another reason why a lot of core business IT systems over the years will continue to move away from residing within internal data centers (which are slowly becoming extinct dinosaurs) to being hosted by SaaS based application provides.

What is Visualization, Modelling & Forecasting – Visualization, modelling and forecasting is the process of seeing your data, understanding your data and designing models of the system that creates that is responsible for generating that data. Once you have built a model of your system and validated that design you can then use the model for purposes of what-if analysis to be able to understand how change in potential input conditions will change the way the system behaves or even predict the breaking point of the system for increase in incoming workload (users, transactions, messages, etc.). Visualization has also taken on a more fancy word today called EDA or Exploratory Data Analysis which focuses on slicing and dicing your data set to view it from different perspectives to be able to see patterns in the data and understand the real story behind the data.

Visualization, modelling and forecasting is part of the capability you should be building within your IT team if you want your business to stay ahead of the game to ensure that you are able to serve customers as demands for your product or service offering grow. Visualization, modelling and forecasting is no rocket science and the tools required to build that capability aren’t rocket science either. It’s the dedication, effort and investment in building the processes which most business and IT organizations shy away from and the most common excuse we’ve heard is, “We don’t have to worry about performance or capacity, we live in the cloud you know….just throw more virtual machines at the problem”. It’s sad but true, throwing more capacity at performance and scalability issues has mostly been viewed a solution to the performance and scalability challenge. And with the advent of cloud computing that paradigm has taken stronger hold across business and IT organizations.

What sort of data is required for Visualization, Modelling & Forecasting – Hopefully by now you are convinced that Visualization, Modelling and Forecasting are essential for the sustainability and growth of your business in this fast paced world where the only constant is change. In this section let us have a look at the different types of modelling techniques followed by the data or metrics that one would require for purposes of building system models and performing forecasts using those models.

Analytical Models – Analytical or Mathematical models generally consist of a set of mathematical equations which are used to model the behaviour of a given system. Common analytical or mathematical modelling approaches we have seen used are Little’s Law (Operational Theory), Queuing Theory, etc. The main advantages of using analytical models is that they require few data inputs but the challenge is the assumptions you have to make with regards to system behaviour. Analytical models are generally prescribed for back of the envelope calculations at the initial design stage.

  • Data required (sample) –
    • System configuration
      • Number of CPU’s
      • Amount of Memory
      • Amount of Storage
    • Workload
      • Incoming transactions/unit time
      • Incoming messages/unit time
      • Incoming workflows/unit time
    • System characteristics
      • Service Time per component
      • Visit count per component

Statistical Models – Statistical models for performance modelling implies use of industry standard statistical modelling techniques to perform what-if analysis to understand system behaviour for growth in business workload. Statistical models are generally more reliable than analytical models (if used sensibly) but rely on empirical data that is generally only available during performance testing (during the application build process) or post the system going live. Examples of statistical models are Time Series Regression, Time Series ARIMA Forecasting, etc.

  • Data required (sample) –
    • Workload Empirical Data
      • Production time series data collected over an extended duration – CPU Utilization / Unit time
      • Production time series data collected over an extended duration – Memory Utilization / Unit time
      • Production time series data collected over an extended duration – Business Transactions / Unit time
      • Production time series data collected over an extended duration – Messages / Unit time

Simulation Models – Simulation models for performance modelling implies the user of industry standard simulation modelling techniques like Discrete Even Simulation, Markov’s chains, etc. for purposes of modelling system behaviour for a given set of input conditions. Simulation models can be tedious and complex to build (unless you have expensive commercial tools) but tend to provide more reliable predictions for behaviour of systems especially when the system is still at the design or build phase.

  • Data required (sample) –
    • System configuration
      • Number of CPU’s
      • Amount of Memory
      • Amount of Storage
    • Workload
      • Incoming transactions/unit time
      • Incoming messages/unit time
      • Incoming workflows/unit time
    • System characteristics
      • Service Time per component
      • Visit count per component

Data requirements differ depending on the type of questions you need answered – To be able to model performance of a system you need the relevant data. Depending on the nature of Performance Modelling you intend to perform the nature of data you require would vary. Let’s briefly look at the different types of questions one might want to answer and the associated data requirements for those Performance Models.

Question – How many orders can we process per hour before we run out of capacity on our boxes – In this case you are looking to answer an application related question which is how many orders can i currently process before i run out of capacity on my current boxes. As an architect or a performance engineer this requires you to have an understanding of the relevant business processes (at a sufficiently high level) supported by the application on the given system, the data generated by those processes on the given system including an understanding of how to obtain that data i.e. orders processed per unit of time.

Let’s assume for purposes of this example that the Orders Placed business process along with the supporting processes i.e. shopping cart, add to trolley, view items, etc. are responsible for consuming compute resources on the system you should find some sort of a relationship between the utilization of the system and the number of transactions processed on the system including a strong relationship between the utilization of the system and the number of orders processed per unit time. At minimum your data requirements would be –

  • Orders / Unit Time
  • Utilization / Unit Time
  • Transactions / Unit Time

Advantages – The advantages of using Time Series Regression to build Performance Models and understand system performance for future growth in workload is a more sensible and reliable way to model system performance. These statistical models help identify the relationships between two system variables i.e. Orders / Unit Time and Utilization / Unit Time and use those relationships to help understand system behaviour for a growth in workload.

Disadvantages – The main disadvantage of such a statistical modelling technique is the amount of data required to be able to create the models and model system behaviour. Collecting data can be painful and a very time consuming task. Also, what you would like to keep in mind is that any changes to system configuration (hardware, software, network changes, etc.) will render your model obsolete.

Question – How long before we run out of space on my storage subsystem – In this case you are looking at a obtaining a view of the growth in data stored on the disk sub system and based on that information you could then plan to provision additional storage. There are a couple of different ways you could look at this question. One of the ways could be to look at this question from an application standpoint, assess the data generated by the relevant business processes and then build a model of the data generated v/s the current storage consumption.

In reality such a model is very difficult to build and most professional’s performance engineers or capacity planners will default to performing a Time Series Forecast to view the growth in data storage requirements. Time Series Forecasts have their own limitations (which we will not delve into at this stage) however, the data required at minimum for such a performance model would be –

  • Data Consumption / Unit Time

Advantages – The advantages of using Time Series Forecasting provide Forecasts is that you can obtain a view of what the system performance would look like based on past historical performance. These models are relatively easy to get up and running i.e. assuming you have the relevant historical data at hand. The more granular the data and bigger the historical data set, the higher the possibility of a stronger the forecast.

Disadvantages – The main disadvantage of such a statistical modelling technique is the amount of historical data required to be able to create the models and model system behavior. Collecting data can be painful and a very time consuming task. You should also keep in mind is that any changes to system configuration (hardware, software, network changes, etc.) will render your model obsolete. One of the biggest concerns that forecasters, performance engineers, architects, developers, etc. have when using a Time Series modelling technique is that Time Series models are forecasting the future based on a knowledge of the past and that can sometime be a very dangerous thing.

Conclusion – The intention of this article was to inform you of the importance of Statistical Modelling techniques for purposes of Visualization, Modelling, Forecasting to be able to perform what-if analysis and understand how system performance/behavior would change for change in input parameters (environment conditions). Statistical Modelling techniques are very powerful but need to be combined along with insight, gut feel and most importantly an understanding of the context within which the business operates.

Modelling Solution: VisualizeIT offers access to a bunch of Analytical Models, Statistical Models and Simulation Mcropped-visualize_it_logo__transparent_090415.pngodels for purposes of Visualization, Modelling & Forecasting. Access to all the Analytical (Mathematical) models is free. We recommend you try out the Analytical models at VisualizeIT which are free to use and drop us a note with your suggestions, input and comments. You can access the VisualizeIT website here and the VisualizeIT modelling solution here –VisualizeIT.

This entry was posted in   .
Bookmark the   permalink.

Admin has written 0 articles

VisualizeIT Administrator & Community Moderator