Knowledge Base & Community Wiki
Time Series Exploratory Data Analysis – Tutorial I
What Is Exploratory Data Analysis – Exploratory Data Analysis (EDA) is an approach for data analysis that employs a variety of techniques to –
- Maximize insight into a data set
- Uncover underlying structure within the data
- Extract important variables for purposes of analysis
- Detect outliers and anomalies within the data
- Test underlying assumptions
- Develop parsimonious models
- Determine optimal factor settings
The EDA approach is precisely that, it’s an approach and not a set of techniques. EDA is an attitude or philosophy about how a data analysis should be carried out. EDA is not identical to statistical graphics although the two terms are used almost interchangeably. Statistical graphics is a collection of techniques, all graphically based and all focusing on one data characterization aspect. EDA encompasses a larger venue; EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. EDA is not a mere collection of techniques; EDA is a philosophy as to how we dissect a data set; what we look for; how we look; and how we interpret. It is true that EDA heavily uses the collection of techniques that we call “statistical graphics”, but it is not identical to statistical graphics per se.
Most EDA techniques are graphical in nature with a few quantitative techniques. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing the data to reveal its structural secrets, and being always ready to gain some new, often unsuspected, insight into the data. In combination with the natural pattern-recognition capabilities that we all possess, graphics provides, of course, unparalleled power to carry this out.
The particular graphical techniques employed in EDA are often quite simple, consisting of various techniques that include –
- Plotting the raw data using line charts, histograms, bi-histograms, probability plots, lag plots, etc.
- Plotting simple statistics such as mean plots, standard deviation plots, box plots, and main effects plots of the raw data.
The intention of using such plots is to maximize our natural pattern-recognition abilities and understand the story behind the data you actually see.
What does VisualizeIT offer in terms of Time Series EDA (Exploratory Data Analysis) Capability – VisualizeIT offers the ability to slice, dice and view your data from different perspectives. The Time Series EDA (Exploratory Data Analysis) capability within VisualizeIT offers the following visualization capability –
- Time Series Line Plot
- Histogram Plot
- Box Plot
- Statistical Measures
A pre-requisite for using the Time Series EDA functionality within VisualizeIT is that you have uploaded data for the given Application / Data Dimension using the Data Management functionality within VisualizeIT. With the data uploaded you are then able to use the Time Series EDA functionality to explore your data.
Let’s start by taking a look at the approach to using the Time Series EDA functionality within VisualizeIT. To get started you would need to log-in to VisualizeIT, select the Statistical Modelling link from the Left Hand Side menu and then select Time Series EDA. You are then presented with the Time Series EDA form.
The first part of the Time Series EDA form above requires you to select a given Application / Data Dimension, the Period i.e. Start Time, End Time, for which you would like to view the data including and Roll-up you would like to perform. A Roll-up might make sense in situations where you have granular data i.e. data collected over 5 minute intervals and you would like to roll them up to an hour when viewed data for an entire month. A Roll-up is in some cases is mandatory when performing a forecast using the Time Series Forecasting capability in VisualizeIT.
Once you have selected the Application, Data Dimension, Start Time, End Time, Roll-up requirement and submitted the form you the resulting display will consist of the following elements for the chosen Application / Data Dimension.
- Time Series Line Plot
- Histogram Plot
- Box Plot
- Summary Statistics
The Time Series Line Plot part of the Time Series EDA capability on VisualizeIT provides you the ability to visualize the behaviour of the Application / Data Dimension over a selected time period. By default if you do not select a time period the Time Series EDA plot presents all the data you have for a given Application / Data Dimension. Not selecting a Start Time and End Time might present all the Time Series Data in a line plot but if you have large amounts of data for an extended period of time you will find that the resulting Box Plots have too many bins with a lot of data on it. One way around this conundrum is to select a shorter period of time which will present neat line plots, histograms and box plots.
The Time Series EDA results also include a Histogram Plot. The Histogram Plot provides a view of the Frequency of the data for the given time period. The Histogram Plot provides you the ability to visualize the spread of data for the Application / Data Dimension and understand if fits a Normal Distribution or some other distribution pattern. Understanding of Frequency Distributions is required when performing certain types of Statistical Modelling. You also have the ability to click the legend on the plot to enable or disable certain layers of the Time Series Histogram Plot.
The final plot on the Time Series EDA results page is the Box Plot. Every element of the Box Plot includes the following view –
- First Quartile
- Third Quartile
A point to be noted here is that, if the number of outliers are too large to be displayed on the Box Plot a subset of those outliers will be presented on the Box Plot for the given duration. You can click the legend to enable or disable the view of outliers on the screen.
The final element on the Time Series EDA plot is the Summary Statistics table. The Summary Statistics table includes the following quantities –
- IQR (Inter Quartile Range)
- Q1 (1st Quartile)
- Q3 (3rd Quartile)
- Standard Deviation
These Summary Statistics table should give you a good understanding of the spread of the data along with the main characteristics of the data set.
Conclusion – The Time Series EDA (Exploratory Data Analysis) capability within VisualizeIT allows you to visualize the data for a given Application / Data Dimension as a time line plot while also providing you a view of the Frequency characteristics of the data through a Histogram Plot and the spread of the data through a Box Plot. Use the Time Series EDA (Exploratory Data Analysis) capability within VisualizeIT to explore your data, understand the patterns in your data set and hopefully gain additional insight into the story behind the data for the given Application / Data Dimension.
Modelling Solution: VisualizeIT offers access to a bunch of Analytical Models, Statistical Models and Simulation Models for purposes of Visualization, Modelling & Forecasting. Access to all the Analytical (Mathematical) models is free. We recommend you try out the Analytical models at VisualizeIT which are free to use and drop us a note with your suggestions, input and comments. You can access the VisualizeIT website here and the VisualizeIT modelling solution here –VisualizeIT.