# Valid Inference with Double Dipping - Online Course

Distinguished Speaker Series: A Seminar Taught by

Daniela Witten
Course Dates:

Monday, December 16, 2024

Schedule: All sessions are held live via Zoom. All times are ET (New York time).

12:00pm-3:00pm (convert to your local time)

ABSTRACT

Textbooks on statistical inference typically assume that the data analyst has chosen a hypothesis to test or a confidence interval to estimate before looking at the data—or, better yet, before they have even collected it! However, in reality, statistical practice often proceeds quite differently: an analyst may first explore the data in order to come up with a statistical question that seems “interesting” and then use the same data to answer that question.

This practice is often described as “double dipping.” Unfortunately, classical statistical machinery does not apply when we have double dipped: for instance, hypothesis tests will reject the null hypothesis far more often than they should, and confidence intervals will not cover the parameter of interest. This leads to spurious findings that will not hold up in future studies. In this course, we will talk about recent developments that enable valid inference with double dipping.

During the first hour, we will consider double dipping through the lens of multiple testing. We will show that in very simple settings, multiple testing corrections—many of which have been around for decades—may be suitable solutions to the double dipping problem. However, when the settings get more complicated (and more realistic) multiple testing corrections don’t cut it.

During the second hour, the conditional selective inference framework will be presented, a relatively new approach to address double dipping, which circumvents the need for multiple testing corrections. In this framework, we use all of our data to identify an interesting question, and then we answer the question again using all of our data, but without re-using any of the (statistical) information that led us to identify the question.

Finally, during the third hour, we will consider approaches that involve splitting the data into a training set and a test set: the training set can be used to come up with an interesting question, and the test set can be used to answer it. The simplest such approach is sample splitting, which is a key tool in any data analyst’s toolbox.

But there are many situations in which sample splitting is either unappealing or inapplicable: for instance, if the sample size is very small, or if the observations are not independent and identically distributed. In such settings, data thinning provides an attractive alternative. Data thinning enables us to “split” even a single datapoint into two independent pieces, so that we can identify an interesting question on one piece and answer it on the other.

Who should attend: This course is intended for data scientists and statisticians who conduct statistical inference (e.g., hypothesis tests and confidence intervals) in the “real world” and want to update their statistical toolset to enable valid analysis when the target of inference is selected from the data.

This Distinguished Speaker Series seminar will consist of three hours of lecture and Q&A, held live* via the free video-conferencing software Zoom.

*The video recording of the seminar will be made available to registrants within 24 hours and will be accessible for four weeks thereafter. That means that you can watch all of the class content and discussion even if you cannot participate synchronously.

Closed captioning is available for all live and recorded sessions. Captions can be translated to a variety of languages including Spanish, Korean, and Italian. For more information, click here.