IMPRS-gBGC course 'Applied statistics & data analysis' 2020, Advanced
Category: Skill course
0.2 per course day
1. Advanced statistics
1.1 Organizational issues
Date: November 16 - 20, 2020
Place: lecture room @ MPI-BGC (depending on COVID-19 regulations)
Planned sessions:
- 09:00 - 09:45 lecture
- 09:45 - 10:00 break
- 10:00 - 11:00 talks
- 11:00 - 11:15 break
- 11:15 - 12:00 excusion
- 12:00 - 13:00 lunch
- 13:00 - 14:00 talks
- 14:00 - 14:15 break
- 14:15 - 15:00 lecture
- 15:00 - 17:00 practical part
Instructor:
1.2 Aims and scope
The course will cover selected topics of advanced statistics and machine learning. Lectures on some topics will be accompanied with presentations by participants, “Excursion” talks on applications in research, and basic practicals in the afternoon. The course requires basic knowledge of statistics. The practical session require basic knowledge with a programming language – examples will be provided in R.
1.3 Presentations by participants (mandatory for assignment)
Participants will give a presentation (20min + 10min Q&A) on a paper or topic of their choice. Below you can find a list of suggested papers. If you want to work on a topic in a team of 2 (i.e. 40min+20min Q&A) or suggest an alternative topic please inquire this until 31st October with the proposed topic to mjung@bgc-jena.mpg.de.
During registration please choose a topic that was not yet chosen.
All presentations need to be ready on Monday 16th Nov 2020 at 9 am. The detailed schedule will be announced then.
The presentations should be educational and try to focus on the important things one should know about a method when applying it, i.e. the principle, advantages, disadvantages, assumptions, and pitfalls, rather than all mathematic details, derivations, theorems and proofs. Practical examples are often very illustrative.
1.4 Other Preparations
Bring a laptop with a recent version of R being installed or running for the practicals. If you prefer another language, that is fine but we will not provide corresponding code examples. Please also make sure that you can access the internet via WLAN (BGC-users, if you have a BGC-account; BGC-guests, if you don't have an account).
1.5 Preliminary agenda
Day | Topic | Who
|
---|---|---|
Monday, November 16 |
| |
9:00 - 09:45 | Introduction to basic statistical tools | Martin Jung
|
09:45 - 10:00 | Break | |
10:00 - 10:30 | Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy | |
10:30 - 11:00 | Toward the true near‐surface wind speed: Error modeling and calibration using triple collocation | |
11:00 - 11:15 | Break | |
11:15 - 12:00 | Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method | |
12:00 - 13:00 | Lunch Break | |
13:00 - 13:30 | Archetypal Analysis | |
13:30 - 14:00 | Visualizing Data using t-SNE | |
14:00 - 14:15 | Break | |
14:15 - 15:00 | Dimensionality reduction | Mirco Migliavacca |
15:00 - 17:00 | Practical | Mirco Migliavacca
|
Tuesday, November 17 |
| |
9:00 - 09:45 | Time series analysis | Lina Estupinan-Suarez
|
09:45 - 10:00 | Break | |
10:00 - 10:30 | Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure | |
10:30 - 11:00 | Summarizing multiple aspects of model performance in a single diagram | |
11:00 - 11:15 | Break | |
11:15 - 12:00 | EXCURSION | Nora Linscheid |
12:00 - 13:00 | Lunch Break | |
13:00 - 13:30 | BGI SEMINAR | |
13:30 - 14:00 | BGI SEMINAR | |
14:00 - 14:15 | Break | |
14:15 - 15:00 | Mixed effect model | Thomas Wutzler |
15:00 - 17:00 | Practical | Thomas Wutzler
|
Wednesday, November 18 |
| |
9:00 - 09:45 | Random Forests | Martin Jung
|
09:45 - 10:00 | Break | |
10:00 - 10:30 | Bias in random forest variable importance measures: Illustrations, sources and a solution | |
10:30 - 11:00 | Isolation Forest | |
11:00 - 11:15 | Break | |
11:15 - 12:00 | EXCURSION | Jacob Nelson |
12:00 - 13:00 | Lunch Break | |
13:00 - 13:30 | A working guide to boosted regression trees | |
13:30 - 14:00 | A unified approach to interpreting model predictions | |
14:00 - 14:15 | Break | |
14:15 - 15:00 | Model evaluation | Martin Jung |
15:00 - 17:00 | Practical | Simon Bessnard
|
Thursday, November 19 |
| |
9:00 - 09:45 | Neural Networks | Basil Kraft
|
09:45 - 10:00 | Break | |
10:00 - 10:30 | Deep learning | |
10:30 - 11:00 | Long Short-Term Memory | |
11:00 - 11:15 | Break | |
11:15 - 12:00 | EXCURSION | Basil Kraft |
12:00 - 13:00 | Lunch Break | |
13:00 - 13:30 | Variable Importance | Martin Jung |
13:30 - 15:00 | Practical | Fabian Ganz
|
Friday, November 20 |
| |
9:00 - 09:45 | Parameter estimation | Nuno Carvalhais
|
09:45 - 10:00 | Break | |
10:00 - 10:30 | Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling | |
10:30 - 11:00 | A comparison of techniques for the estimation of model prediction uncertainty | |
11:00 - 11:15 | Break | |
11:15 - 12:00 | EXCURSION | Tina Trautmann |
12:00 - 13:00 | Lunch Break | |
13:00 - 13:30 | Deep learning and process understanding for data-driven Earth system science | |
13:30 - 14:00 | Feedback
|
1.6 Interested?
Prerequisites:
- Basic knowledge of a language of scientific computing: R, Matlab
- Make use of the R course - The basics
- Either the course 'Basic statistics' or recalling the typical “statistics 1” type of lectures from university.
Exercises will be in R – the use of any other language is welcome; however support depends on the person in charge and cannot be guaranteed.
Learn R… Here is a list of useful online resources to help you bring your R skills to a new level.
The material from the R basics course might also be useful for you.
1.7 Material
Here, you can download the papers, which you will need for your presentation.
1.8 Requirements for the assignment
All participants have to prepare a short presentation on one "unconventional" method of their choice: Every day will have a few of these presentations and we want to discuss with you about the pros and cons: Please register for one of the following topics (but feel free to add another one).
Important
- Don’t choose a technique that you know already!
- Check the list of participants below and choose a topic that has not yet been selected. Ideally, we would like to cover all topics.
....and note that we are not necessarily experts in the methods.
# / NAME OF PRESENTER | Topic | Context |
1 SOPHIA WALTER | Archetypal Analysis | Multivariate data representation |
2 / ANN-SOPHIE LEHNERT | A working guide to boosted regression trees | non parametric regression |
3 / | From outliers to prototypes: Ordering data | novelty/outlier detection |
4 / SANTIAGO BOTIA | Long Short-Term Memory | neural networks for time series |
5 | Calibration of process-oriented models | model calibration and evaluation |
6 / SOPHIE VON FROMM | Deep learning | deep learning overview |
7 / CAGLAR KUCUK | A unified approach to interpreting model predictions | variable importance, explainable AI |
8 | Quantile regression forestsa | random forest, quantile regression |
9 / | Deep learning and process understanding for data-driven Earth system science | deep learning and hybrid modeling for Earth System Science |
10 / SINIKKA PAULUS | Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure | model evaluation |
11 | MissForest—non-parametric missing value imputation for mixed-type data | random forests, data imputation (filling missing data) |
12 / WEIJIE ZHANG | Bias in random forest variable importance measures: Illustrations, sources and a solution | random forest, variable importance |
13 | Measuring and Testing Dependence by Correlation of Distances | non-linear correlation |
14 /ULISSE GOMARASCA | Visualizing Data using t-SNE | dimensionality reduction, multivariate data visualization |
15 | The energy of data | non-parametric statistics based on distances |
16 / QIAN ZHANG | Summarizing multiple aspects of model performance in a single diagram | model evaluation |
17 / WANTONG LI | Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling | model evaluation and calibration |
18 / HOONTAEK LEE | Isolation Forest | random forest, novelty/outlier detection |
19 / YUNPENG LUO | Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy | uncertainty |
20 / SIYUAN WANG | A comparison of techniques for the estimation of model prediction uncertainty | uncertainty |
21 / | Verification, validation, and confirmation of numerical models in the earth sciences | model evaluation and calibration |
22 / ALBRECHT SCHALL | Hierarchical Clustering via Joint Between-Within Distances: Extending Ward's Minimum Variance Method | clustering |
23 | Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting | smoothing |
24 / JASPER DENISSEN | Toward the true near‐surface wind speed: Error modeling and calibration using triple collocation | uncertainty |
2. Feedback
Your feedback is valuable because it helps the instructors and organizers to improve the individual modules and the general structure of the workshop.
The survey results are available here. Statistics and statements should not be taken as an exhaustive or exclusive list.