Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often using visual methods.
Allowing the user to:
- understand the data’s structure,
- understand how to provide a good model,
- detect patterns,
- decide of Statistics techniques by looking at Distributions and Correlation
- decide on which variabled to include,
- how to handle Data Quality
- spot anomalies & Outliers
- Generating/testing hypotheses with stastical methods,
- and check assumptions through statistical summaries and graphical representations.
Common Techniques Used in EDA
- Descriptive Statistics: Calculating measures such as mean, median, mode, standard deviation, and percentiles to summarize data.
- Data Visualization: Using plots and charts like histograms, box plots, scatter plots, and bar charts to visually explore data.
- Correlation Analysis: Assessing relationships between variables using correlation coefficients and scatter plots.
- Data Transformation: Applying transformations to data, such as normalization or log transformation, to better understand its characteristics.
Tools and Libraries for EDA
- Python Libraries: Pandas, Matplotlib, Seaborn, and Plotly are commonly used for EDA in Python.
- Data Visualization Tools: Tools like Tableau and PowerBI can also be used for interactive exploratory analysis.
code snippets:
What are the data types for various features?
df.info()
Get a summary of missing data of a df.
What does
do?
x.columns.values: Assuming x
is a DataFrame, x.columns
retrieves the column names of the DataFrame, and values
converts them into a NumPy array.
columns=[‘Features’]: This sets the column name for the feature names array in the new DataFrame. The resulting DataFrame has one column named ‘Features’ containing the names of the features.