Workflow of Data Analysis

A 2-Day Seminar Taught by Bianca Manago, Ph.D.


Statistical analyses are only as good as the data that go into them. This is why the majority of time on any data analysis project should be spent, not on conducting the analyses (i.e., actually running the model), but instead on the steps needed to prepare the data for analysis. There are dozens of decisions that go into data management.  If not properly documented or considered, those decisions can produce erroneous results or preclude replication.

This seminar is designed to teach researchers how to prepare data for analysis in a way that is both accurate and replicable. By following these principles, your data analytic projects will be both well-planned and executed. The scope of the seminar ranges from such broad topics as developing research plans to the detailed minutia of planning variable names.

This seminar is for researchers who are trying to establish or improve their workflow. I do not expect participants to be expert programmers; this seminar should be accessible to very novice R users, while still being useful to more advanced users. Lessons from this seminar balance ease of use with proper functioning, introducing researchers to useful tools, e.g., dual-pane browsers, macro programs, plain text editors, R Studio, and GitHub. For those who are already familiar with these tools, this seminar will teach you how to optimize them. Lessons from this seminar should make conducting research less painful, more efficient, more accurate, and reproducible.

This is a hands-on seminar with ample opportunities to plan and practice your workflow.

Some highlights include:

  • Planning (analyses, sensitivity analyses, variable construction, etc.)
  • Directory structure
  • Data preservation
  • Documentation
  • Dual workflow (separating data management and analyses)
  • Writing robust script files
  • Using log files
  • Variable naming
  • Value labeling
  • Reproducibility and replication
  • Examining data

COMPUTING

The empirical examples and exercises in this course will emphasize R, but there will be equivalent code and examples presented/available for Stata. To fully benefit from the course, you should bring your own laptop loaded with R or Stata. Whichever package you choose, you should already have a working understanding of the software and be able to complete basic functions in the software.

If you’d like to take this course but are concerned that you don’t know enough R, there are excellent online resources for learning the basics. Here are our recommendations.


Who should attend?

This course is for anyone who wants to improve the efficiency and accuracy of their data analysis and presentation.


LOCATION, FORMAT AND MATERIALS

The class will meet from 9 am to 5 pm each day with a 1-hour lunch break at Temple University Center City, 1515 Market Street, Philadelphia, PA 19103. 

Participants receive a bound manual containing detailed lecture notes (with equations and graphics), examples of computer printout, and many other useful features. This book frees participants from the distracting task of note taking. 


Registration and lodging

The fee of $995 includes all course materials. The early registration fee of $895 is available until March 24.

Refund Policy

If you cancel your registration at least two weeks before the course is scheduled to begin, you are entitled to a full refund (minus a processing fee of $50). 

Lodging Reservation Instructions

A block of guest rooms has been reserved at the Club Quarters Hotel, 1628 Chestnut Street, Philadelphia, PA at a special rate of $169 per night. This location is about a 5-minute walk to the seminar location. In order to make reservations, call 203-905-2100 during business hours and identify yourself by using group code STH423 or click here. For guaranteed rate and availability, you must reserve your room no later than Monday, March 23, 2020.

If you need to make reservations after the cut-off date, you may call Club Quarters directly and ask for the “Statistical Horizons” rate (do not use the code or mention a room block) and they will try to accommodate your request.


Outline

PART 1: INTRODUCTION TO WORKFLOW
1. What is “workflow”?
2. Why care about WF?
3. WF and replication
4. Steps in and principles of WF

PART 2: PLAN, ORGANIZE, DOCUMENT, AND PRESERVE
1. Planning research projects in the:
     a. Large (overall questions, project checklist, and timeline)
     b. Middle (data cleaning, analyses, tables, and figures)
     c. Small (naming variables, naming files, value labels, and order of
         analyses/cleaning)
2. Organizing files and folders
3. Documentation
4. Preserving data and preventing loss
5. Replication

PART 3: SCRIPT FILES IN R
1. Strengths and weaknesses of R for workflow
1. Dual workflow
2. Robust script files
3. Legible script files
4. Automation in script files

PART 5: CLEANING, LABELING, & MISSING DATA
1. Naming and labeling variables
2. Missing data
3. Merging data
4. Verifying data

PART 6: ANALYZING & PRESENTING FINDINGS
1. Principles of data analysis
2. Documenting provenance
3. The posting principle
4. Presenting findings

PART 7: COLLABORATION
1. Key factors in collaboration
2. Introducing workflow with co-authors
3. Coordinating workflow with multiple authors