Reproducible Data Analysis: an Essential Capability in Modern Science
Description
The scientific method is historically linked to the possibility that other researchers can replicate and verify its results. As scientific analysis becomes more complex and interdisciplinary, ensuring reproducibility becomes more challenging specially in fields that combine different expertise. To promote transparency, consistency and robustness in science, journals, funders, and institutions are encouraging the use of tools and practices that enhance reproducibility. Lifelong learning helps professionals to keep up with the fast-paced scientific developments and to foster creativity and innovation. By learning about version control, containers, pipelines and data reproducibility, scientists of all levels can improve the reproducibility of their research, as well as the impact and reliability of their findings and methods. Moreover, using the methods introduced here they can collaborate and experiment in ways that allow reproducibility and creativity to coexist and thrive.
Detailed content
Session 1
- Successful examples of the use of these tools for reproducibility
- Aspects of reproducibility of data
Session 2
- Introduction to Git and GitHub concepts
- Routine usage of Git
- Inspect and compare different versions of a git project
- Connecting and integrating to GitHub
- Collaborate and experiment with Git and GitHub
Session 3
- How to access VSC facilities and use the HPC scheduling
Session 4
- Introduction to containers basic concepts and Docker syntax
- Find, obtain, and run a Docker image
- Adapt and build Docker recipes
- Find and run Apptainer images
- Adapt and build Apptainer images
- Use Apptainer in the VSC with the scheduling system
Session 5
- Introduction to NextFlow concepts and syntax
- Execute NextFlow pipelines with different executors and environments
- Write and run a NextFlow pipeline
- Write and modify modules and config files as best practice for pipeline development
- Use NextFlow in the VSC with the scheduling system
Session 6
- Projects:
* 2 small projects
o Git & GitHub project consists of creating collaboratively your documentation, with version control. The project is started during the lesson and finished asynchronously before delivery. (Estimated asynchronous time 3h)
o Docker and Apptainer project consist of adapting, writing, and building one Docker image based on a Docker recipe. The project must be delivered in GitHub, with history of versions available. The project is to be collaboratively developed after the lesson. (estimated asynchronous time 6h)
* 1 medium project
o NextFlow project consists of using docker or Apptainer images to create and run NextFLow pipeline that use config files and modules. The project must be delivered in GitHub with the history of versions available.
o Complementarily, an oral presentation (defence) of the final project must include a summary of the topics learned and examples that can demonstrate the use of the tools and focusing on reproducibility.
o The project is to be developed collaboratively after the lesson (estimated asynchronous time 8h)
Course prerequisites
* Being able to use simple shell commands (Linux for example), you can use this e-learning material to prepare.
* Experience with scripting is preferred (point to resource of catch-up before the course)
* Creating a VSC account
Final competences
- To use Git and GitHub for version control, reproducible and easy to share code, text documents, and other appropriate data for this context;
- To use Git and GitHub for individual projects and for collaboration;
- To use, adapt and write containers locally and in a super computer, understanding the different systems and its particularities;
- To use and to differentiate Docker containers and Apptainer images and their particular usage.
- To run, adapt and write NextFlow pipelines locally and in a super computer,
- To understand and to make use of best practices for NetFlow pipelines, such as config-files and modules;
Exam
Project evaluation and a Oral evaluation.
Type of course
This is an on campus course.
Course material
Syllabus, overheads, exercises handout
Location
Technology park, 75 – CMB building (FSVM II), 9052 Ghent
Day 1: L5 room, the 5th floor
All the other L4 room, the 4th floor
Register here
Reproducible Data Analysis: an Essential Capability in Modern Science
Data
Februari 28, from 9.30 am to 5 pm
March 10 & 11, from 9.30 am to 5 pm
March 28, from 9.30 am to 1.30 pm
March 31, April 1, 28, 29 from 9.30 am to 5 pm
May 12, from 10 am to 5 pm
Price
- €650
More info & Subsciption
UGent website: Universiteit Gent: Micro-credential Reproducible data analysis
Inschrijven via tabblad: Off to a good start
Contactpersoon: science-academy@ugent.be