Not OP. This question is being reposted to preserve technical content removed from elsewhere. Feel free to add your own answers/discussion.

Original question:

I’m being provided a dataset with several variables in it, and a success metric (1 or 0) at the end. I’m being asked to analyze the dataset and give insights on how to improve the success metric rate. To do this I intend to do a thorough data analysis to study correlations and relationships. However I’m also intending to run a logistic regression to confirm these correlations with the features coefficients.

My question is, if my sole interest is understanding the most important feature determining a metric, and not building a robust model, should I still split my datasets into 2 ? What benefits do I have splitting it ? Won’t my exploratory analysis loose interest if I’m putting away - let’s say- 20% ?

Thank you for your help

  • ShadowAetherOPM
    link
    fedilink
    English
    arrow-up
    2
    ·
    2 years ago

    Original answer:

    What benefits do I have splitting it ?

    There is certainly an argument that keeping a part of the data out of the analysis (basically hidden from you during analysis) would help you validate any conclusions you make during the exploratory analysis on the rest of the data.