50 Years of Data Science

ABSTRACTMore than 50 years ago, John Tukey called for a reformation of academic statistics. In “The Future of DataAnalysis,” he pointed to the existence of an as-yet unrecognizedscience, whose subject of interest waslearning from data, or “data analysis.” Ten to 20 years ago, John Chambers, Jeff Wu, Bill Cleveland, andLeo Breiman independently once again urged academic statistics to expand its boundaries beyond theclassical domain of theoretical statistics; Chambers called for more emphasis on data preparation andpresentation rather than statistical modeling; and Breiman called for emphasis on prediction rather thaninference. Cleveland and Wu even suggested the catchy name “data science” for this envisioned field. Arecent and growing phenomenon has been the emergence of “data science”programs at major universities,including UC Berkeley, NYU, MIT, and most prominently, the University of Michigan, which in September2015 announced a $100M “Data Science Initiative” that aims to hire 35 new faculty. Teaching in these new programs has significant overlap incurricular subject matter with traditional statistics courses;yetmanyaca-demic statisticians perceive the new programs as “cultural appropriation.”This article reviews some ingredi-ents of the current “data science moment,”including recent commentary about data science in the popularmedia, and about how/whether data science is really different from statistics. The now-contemplated fieldof data science amounts to a superset of the fields of statistics and machine learning, which adds sometechnology for “scaling up”to “big data.”This chosen superset is motivated by commercial rather than intel-lectual developments. Choosing in this way is likely to miss out on the really important intellectual eventof the next 50 years. Because all of science itself will soon become data that can be mined, the imminentrevolution in data science is not about mere “scaling up,”but instead the emergence of scientific studies ofdata analysis science-wide. In the future, we will be able to predict how a proposal to change data analysisworkflows would impact the validity of data analysis across all of science, even predicting the impacts field-by-field. Drawing on work by Tukey, Cleveland, Chambers, and Breiman, I present a vision of data sciencebased on the activities of people who are “learning from data,”and I describe an academic field dedicatedto improving that activity in an evidence-based manner. This new field is a better academic enlargement ofstatistics and machine learning than today’s data science initiatives, while being able to accommodate thesame short-term goals.David DonohoDepartment of Statistics, Stanford University, Standford, CA

# Search curl -X POST "https://search.dria.co/hnsw/search" \ -H "x-api-key: <YOUR_API_KEY>" \ -H "Content-Type: application/json" \ -d '{"rerank": true, "top_n": 10, "contract_id": "C8CeuJl0A6ThmrovnjgPhCS4vIY8sRgDX0uD84n4ZqA", "query": "What is alexanDRIA library?"}' # Query curl -X POST "https://search.dria.co/hnsw/query" \ -H "x-api-key: <YOUR_API_KEY>" \ -H "Content-Type: application/json" \ -d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "C8CeuJl0A6ThmrovnjgPhCS4vIY8sRgDX0uD84n4ZqA", "level": 2}'