The data science lifecycle is based on the fundamental scientific process of:

  1. What is the question at hand?
  2. What data is actually available?
  3. Data Preprocessing (e.g., Feature Engineering)
  4. Data Modeling and Subsequent Performance
  5. Review Modeling Outputs
  6. What are potential Answer(s)

Access to the data is a key component of any data science project, once you have access to it there is subcomponents of data cleaning and exploration.

Once you start assembling 'data sets' the exploration of these data sets and refining, cleaning is paramount in your journey to achieving a reproducible outcome you can stand behind.  Like a building with solid foundations, you need to understand what your feeding into the pipeline.

Feature engineering is the addition of new fields (ie. facts) within the data sets based on other fields already present.  This can come through Subject matter experts (e.g., scientists, engineers) that can provide insights on the process and phenomena that may and/or may not be held within the data

Predictive modeling is an area of study to help humans leverage advanced mathematical approaches to find patterns in the historical data upon which future predictions can occur.

Fundamentally you are doing this work to answer the question at hand, what is the modeling technique telling you, can you verify the outcomes with subject matter experts within the domains?