Reproducible code #61

eshom · 2021-07-30T18:55:35Z

I think as a standard all scripts should be completely independent and reproducible. I.e. people should be able to copy and paste code in their R REPL session without errors. This is currently not the case with many scripts in this repo. Instead of supplying example data, many algorithms are written as "templates" where one has to input their own data. However, there's no information what the data structure should even be.

R has many built in datasets, so these can be used to run algorithms with. If the script is just a function definition, then there should be an example usage of the function.

I could list here all scripts that need to be written this way.

What do you think?

eshom · 2021-07-30T19:50:16Z

./Data-Preprocessing/lasso.R
./Data-Preprocessing/K_Folds.R
./Data-Preprocessing/data_processing.R
./Data-Preprocessing/dimensionality_reduction_algorithms.R
./Classification-Algorithms/lasso.R
./Classification-Algorithms/decision_tree.R
./Classification-Algorithms/KNN.R
./Classification-Algorithms/gradient_boosting_algorithms.R
./Classification-Algorithms/LightGBM.R
./Classification-Algorithms/SVM.R
./Classification-Algorithms/xgboost.R
./Classification-Algorithms/naive_bayes.R
./Classification-Algorithms/random_forest.R
./Clustering-Algorithms/K-Means.R
./Clustering-Algorithms/dbscan_clustering.R
./Clustering-Algorithms/gmm.R
./Clustering-Algorithms/pam.R
./Clustering-Algorithms/kmeans_raw_R.R
./Association-Algorithms/apriori.R
./Regression-Algorithms/logistic_regression2.R
./Regression-Algorithms/logistic_regression.R
./Regression-Algorithms/linear_regression.R
./Regression-Algorithms/KNN.R
./Regression-Algorithms/gradient_boosting_algorithms.R
./Regression-Algorithms/LightGBM.R
./Regression-Algorithms/ANN.R
./Regression-Algorithms/multiple_linear_regression.R
./Regression-Algorithms/linearRegressionRawR.R
./Data-Manipulation/OneHotEncode.R
./Data-Manipulation/LabelEncode.R

siriak · 2021-08-01T11:00:43Z

How can the scripts be tested if they don't accept data as arguments? I think we need to add unit tests instead. They will test our code and provide users with examples at the same time.

eshom · 2021-08-01T11:54:23Z

I think this can be part of the documentation solution we talked about in #59. Using knitr we can turn scripts into HTML reports, which would nicely incorporate example output. Errors caused by bad scripts can be handled, printed, and reviewed. I can write R code for this, but I'm not sure how to set up github actions correctly.

siriak · 2021-08-01T12:55:45Z

So you suggest having algorithms separated from data and unit tests that will show usage of the algorithms? And the tests can be transformed into HTML reports for convenience? Sounds good to me

eshom · 2021-08-01T14:02:18Z

Hmm not exactly. What I mean is that scripts specially formatted can be turned into HTML reports (https://rdrr.io/cran/knitr/man/spin.html). Data would still need to be part of the algorithms. Because this function, while trying to compile a report, runs the actual script - errors would be thrown if there's any problem with the script. That error can be part of a test. At the same time good scripts would compile to nice HTML reports.

It would make more sense once we have a prototype running in https://github.com/Panquesito7/R/tree/documentation_stuff

alexgarland · 2021-08-01T16:51:01Z

I agree with you on this fundamental issue; for linearRegressionRaw.R, I replaced a reference to the diamonds dataset with a specifically simulated and reproducible (via a set seed) synthetic dataset.

Half of the challenge here is going to be eliminating extraneous library calls, such as with the tidyverse functions and datasets.

eshom · 2021-08-01T18:25:18Z

I personally don't mind if third party packages are used, but either the include.only operator should be used in order to only attach to the search path objects that appear in the code, or preferably it should be replaced entirely with the double colon operator to make everything more explicit.

In either case, some check should be done if packages are installed. Something like:

if (!require(ggplot2)) 
    install.packages("ggplot2")
    
# The rest of the code
# ...

eshom added the enhancement New feature or request label Jul 30, 2021

Panquesito7 pinned this issue Jul 30, 2021

siriak mentioned this issue Jan 21, 2022

Where are the sample datasets? #39

Closed

TheAlgorithms deleted a comment Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducible code #61

Reproducible code #61

eshom commented Jul 30, 2021

eshom commented Jul 30, 2021

siriak commented Aug 1, 2021

eshom commented Aug 1, 2021

siriak commented Aug 1, 2021

eshom commented Aug 1, 2021 •

edited

Loading

alexgarland commented Aug 1, 2021

eshom commented Aug 1, 2021

Reproducible code #61

Reproducible code #61

Comments

eshom commented Jul 30, 2021

eshom commented Jul 30, 2021

siriak commented Aug 1, 2021

eshom commented Aug 1, 2021

siriak commented Aug 1, 2021

eshom commented Aug 1, 2021 • edited Loading

alexgarland commented Aug 1, 2021

eshom commented Aug 1, 2021

eshom commented Aug 1, 2021 •

edited

Loading