Prompt:

  1. Write up the strategy you use for EDA (Exploratory Data Analysis).
  2. What is your overall goal when doing an EDA?
  3. What methods do you think are important?
  4. What things do you try to look for?

Response:

The purpose of an EDA is perform visualizations and identify significant patterns, while also providing hypotheses for why these patterns occur. (1) First, we should try to understand our data and collect as must context and information about the data as possible. Noting a unique identifier is important in this step. (2) Next, we should check for missing data. It is important to check all variables for missing data and possibly rank the variables from the most missing data to the least. For each variable, it is important to try to understand why data is missing (something which doesn’t always have a clear answer) and what it can mean. (3) Next, we need to provide basic descriptions and features of our sample. Features will be categorized as either continuous, discrete, or categorical. Categorizing these features helps us determine what visualizations to choose for our EDA. (4) Next, we identify the shape and distribution of our data. It is important to calculate the mean and variance of each feature, and try to hypothesize about any behavior we see. (5) Another important step is to identify any significant correlations between variables. (6) Looking for and spotting any outliers in our data set is also very important because outliers can lead to major problems when performing statistical tasks.

The overall goal when doing an EDA is to analyze and investigate data sets and summarize their main characteristics. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions (IBM article). Basically, it gives you a nice and informative overview of your data so that you can best decide what to do with it, how to appraoch it, and how to find the answers you might be searching for.

I feel that univariate and multivariate graphical methods are most important, because seeing graphs is the easiest way to understand your data and how certain variables may be correlated (or not correlated). This is essential to form hypotheses and find patterns. You might look at statistical summaries of variables as well, such as mean, median, and variance, or contingency tables that show the frequency of our categorical variables. It is also important to look at quantitative variables at each setting of some of the categorical variales to find patterns. Another thing that I think is really important is checking for missing data as well as outliers, since these can alter your tables and graphs significantly. By accounting for these things, your EDA will be more accurate and informative.


<
Previous Post
Project 2 Blog Post!
>
Next Post
Fourth Blog Post!