Beginner’s Guide to Exploratory Data Analysis

Table Overview of Exploratory Data Analysis

Phase	What to check	Red flags	What to record (reproducibility)

1) Dataset shape

# rows/cols, file sources, time range

Weirdly small/huge, mixed sources w/o labels

Data snapshot: counts, source paths, date pulled

2) Schema sanity

dtypes, parsing issues, category levels

IDs stored as floats, dates as strings

dtype conversions + why they were done

3) Missingness map

per-column %, per-row %, patterns by subgroup

missing not at random, “all-or-nothing” blocks

missingness stats + handling decision

4) Duplicates & keys

unique IDs, repeated records, 1:many joins

duplicates inflate performance

key policy + dedupe logic

5) Target integrity

label balance, definition consistency

label leaks into features, unclear time alignment

how target was defined + exclusions

6) Basic distributions

histograms/boxplots, range checks

impossible values (age < 0), heavy skew

notes on transforms (log, clipping)

7) Outliers

extreme values, rare categories

outliers are errors OR true rare events

outlier rule + keep/remove justification

8) Relationships

correlation, group comparisons, scatterplots

“too perfect” relationships

suspicious features list + candidate features

9) Leakage scan

features encoding future info

timestamps after prediction time

removed/blocked leakage features

What Is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the initial process of examining a dataset to understand the structure, patterns, anomalies, and relationships between variables in the data [1]. Rather than starting with a fixed hypothesis or model, EDA follows an open-ended approach. Through EDA the data itself is able to suggest what questions are worth pursuing. A useful analogy to think about EDA is to imagine receiving a sealed box filled with unknown components. Before attempting to assemble anything, one must open the box, inspect each piece, and understand what can be made with the components.

What EDA Produces (Practical Outputs)

EDA is not just about “looking around” in the dataset, it produces tangible artifacts that guide everything downstream. By the end of EDA, you should be able to point to specific outputs such as:

A data dictionary: what each column represents, its units, valid ranges, and assumptions
A data quality report: missingness, duplicates, outliers, and parsing issues
A risk register: known issues such as leakage risk, confounds, and target ambiguity
A modeling readiness decision: whether the dataset is ready for modeling or needs revision

This “deliverable mindset” helps EDA stay systematic and reproducible rather than purely informal exploration.

Real world datasets are frequently messy, incomplete, and potentially misleading. EDA provides the opportunity to understand what the data truly contains before committing to a modeling decision. In machine learning, the quality and structure of the data often have a greater impact on the performance rather than the choice of algorithm [2]. This principle is encapsulated in the adage “Garbage In, Garbage Out.” Applying models without understanding the dataset may lead to spurious correlations, poor performance, or data leakage. EDA helps understand the nature of the variables, identify limitations in the data, and determine if meaningful learning of a target is possible [1]. EDA also helps reveal considerations on how to preprocess, transform, and filter the data. Performing EDA early on helps avoid costly modeling mistakes later in the machine learning pipeline.

EDA is not driven by a specific model or prediction task. EDA is intentionally exploratory. This helps ensure that unexpected patterns are not overlooked and that assumptions imposed later by models are grounded in evidence rather than convenience. However, EDA must also be approached with scrutiny, as early impressions may be misleading if not examined critically [3].

A useful real-world habit is to treat your EDA outputs like a lightweight “dataset documentation” record. One well-known proposal is to attach a structured datasheet to datasets that records motivation, composition, collection process, and recommended uses.

EDA vs Confirmatory Data Analysis (CDA)

EDA and CDA often work together, but they serve different purposes:

EDA focuses on hypothesis generation and sanity checking (finding patterns, issues, and candidate directions).
CDA focuses on hypothesis testing under scrutiny (formal evaluation, controlled analysis, and validation).

EDA asks: “What is this data really saying, and what problems does it contain?”
CDA asks: “Does this claim still hold after we test it carefully?”

A critical principle to emphasize when performing EDA is reproducibility. While EDA is often informal and investigative, its purpose is not to create simple one-time insights, but to systematically refine the dataset so that analysis may be reliably revisited in the future. The decisions made during the exploratory data analysis process will directly shape downstream modeling. It is therefore imperative that these decisions and changes made are reproducible. Without proper records of exploratory steps, insights lose their ability to be a dependable foundation for CDA. Thorough documentation of the EDA process and the decisions made therein ensures that results may be replicated, validated, and extended both by the analyst performing EDA and by others.

Objectives Of EDA

A Simple Order of Operations for EDA

A practical EDA workflow tends to follow a repeatable sequence:

Shape → Types → Missingness → Duplicates → Target sanity → Leakage scan → Distributions → Relationships → Baseline model (optional)

This ordering matters because early issues (like broken data types, duplicated keys, or leakage) can distort everything you see later.

One of the main objectives of EDA is to understand the basic properties of the data. Statistical summaries of variables such as their mean, median, variance, standard deviation, and range provide insight into the central tendency and spread, indicating how the data is distributed [2]. However, EDA does not end at numerical summaries. It is equally important to understand the size and structure of the dataset, the types of data contained in the dataset, and how the variables are distributed across observations.

A second objective in EDA is identifying relationships among variables. Correlations, dependencies, and interactions between variables help to guide feature engineering and model design. These relationships may identify variables that could leak information about the target. Removing and correcting such features is essential for building models that can successfully generalize from the dataset. Additionally, EDA plays a critical role in detecting anomalies, outliers, and missing values in the dataset, which may heavily influence model behavior if left unchecked.

EDA Techniques

EDA relies on both visualization and statistical summarization of the data. Graphs such as scatter plots, histograms, box plots, and quantile-quantile plots assist in seeing patterns that may not be obvious from statistics alone. While there is some subjective interpretation in graph analysis, visualizations corroborated by statistical analysis and/or multiple visualizations help show evidence of patterns in the dataset [3].

Alongside visualization techniques, statistical analysis provides compact descriptions of the data. A summary statistic provides a single value that characterizes some aspect of the dataset. These values can provide a quick overview of the data structure or clustering. Though quick and efficient, caution should be exercised when analyzing statistics, as they can easily conceal important details in the dataset, especially in the presence of extreme values or non-standard distributions [2].

Although EDA does not conform to a specific model, it may include some simple exploratory modeling. For instance, fitting a basic linear regressor may raise insights about the structure of the data. A near-perfect fit may indicate data leakage or overly deterministic relationships between variables. On the opposite end, a poor fit might indicate a non-linear structure. Though it is equally likely to be caused by missing variables or noisy measurements [1]. When using models in EDA, it is important to remember that the goal of the model is not to validate pre-existing hypotheses about the dataset, but to inform avenues of further investigation.

Common EDA Traps (That Lead to Bad Models)

EDA can be misleading if early impressions are not examined carefully. Common mistakes include:

Treating correlation as explanation: relationships may be driven by confounds, duplicated records, or sampling bias
Over-cleaning outliers: removing extreme values can delete the true signal (not just noise)
Imputing without checking patterns: missing data may reflect a real mechanism (not randomness)
Accidentally creating leakage: features may include information from the future relative to the prediction task
Believing a single visualization: plots should be cross-checked with statistics and subgroup comparisons

EDA should reduce uncertainty, not create false confidence.

Subcategories Of EDA

EDA is organized according to how many variables are examined at once, being either univariate, bivariate, or multivariate (these are further split into graphical and non-graphical analysis).

Univariate analysis focuses on a single variable. Analysis of the variable helps understand its distribution, cleanliness, and variability. Tools such as summary statistics, histograms, and box plots help determine if a variable is skewed, contains outliers, and the presence of missing values [1].

Bivariate analysis examines how pairs of variables relate. Scatter plots are commonly used to visualize potential associations in numerical data, while correlation coefficients and covariance quantify how strong variables change together. When variables are categorical using cross-tabulation and contingency tables reveal how frequently combinations of the variables occur together [1]. These relationships help determine whether variables are related, and whether those relationships may be useful for prediction.

Multivariate analysis examines three or more variables to determine how features interact in higher-dimensional space. Techniques such as pair plots provide an overview of multiple relationships at once. Unsupervised learning methods such as Principal Component Analysis (PCA) may also be used to help identify the most informative components in the data. Depending on the nature of the dataset various methods such as time series analysis or spatial analysis may also be appropriate for exploring the dataset [1].

Further Considerations

Handling missing data is an important concern in data analysis. Decisions on whether to remove or impute (replace with plausible values) missing values may significantly impact downstream model results. Removing data may result in the removal of meaningful information and relationships. However, even if missing data is imputed, improper imputation may introduce uncertainty and bias. The effects of both removing and imputing missing data should be considered during the interpretation of results.

In the case of exceedingly large datasets, EDA may involve determining which features are most informative and which features may be discarded. Reducing the dimensionality of the feature space may help simplify models and improve performance, though careless feature removal may eliminate information or introduce bias [3]. For this reason, making decisions on dropping features should be guided by exploratory findings rather than an arbitrary set of rules. Additionally, understanding the size and complexity of the dataset helps determine which machine learning methods are feasible.

Conclusion

Exploratory Data Analysis is a foundational step to machine learning. By carefully examining the data before implementing machine learning models, EDA reveals structure in the dataset, uncovers potential complications, and helps to ensure subsequent analysis is grounded in reality rather than assumption. Effective EDA is not only about discovering patterns in the dataset, but doing so in a way that is reproducible. Although EDA is inherently exploratory, the transformations, filtering decisions, and assumptions made during this phase shape downstream results, and must therefore be thoroughly documented. Reproducibility allows analysts to revisit, validate, and extend prior work, making exploratory insights a reliable foundation for confirmatory analysis and modeling.

EDA is driven by careful questioning. It is important to consider the problem the data is intended to address, what each variable represents, and the limitations or quality issues that may exist. Exploratory Data Analysis contrasts with Confirmatory Data Analysis. In confirmatory analysis, hypotheses identified during EDA are formally tested to see if they hold under scrutiny. The creator of EDA, John Tukey, stated

“Unless the detective finds clues, judge or jury has nothing to consider. Unless exploratory data analysis uncovers indicators, there is like nothing for confirmatory data analysis to consider [4].”

By the End of EDA, You Should Have:

A data dictionary (meaning, units, valid ranges)
A quality report (missingness, duplicates, outliers, parsing issues)
A leakage/confound list (features to block or constrain)
A preprocessing plan (encoding, scaling, imputation strategy)
A baseline benchmark (even if intentionally simple)
A reproducible notebook/script (same results on re-run)

References

[1] F. Hartwig and B. E. Dearing, Exploratory Data Analysis. Thousand Oaks, CA, USA: SAGE Publications, 2021. [Online]. Available: https://methods.sagepub.com/book/mono/exploratory-data-analysis/toc

[2] T. G. Avval and Co-author(s), “The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE),” J. Chem. Inf. Model., vol. 61, no. *, pp. –, 2021, doi: 10.1021/acs.jcim.1c00244. [Online]. Available: https://pubs.acs.org/doi/10.1021/acs.jcim.1c00244

[3] S. Ferketich and J. Verran, “Technical Notes,” Research in Nursing & Health, vol. 9, no. 4, pp. 409–*, 1986, doi: 10.1177/019394598600800409. [Online]. Available: https://journals.sagepub.com/doi/10.1177/019394598600800409

[4] Tukey, J. W. (1977). Exploratory data analysis. Menlo Park: Addison Wesley.