Data has been gathered from different sources listed below:
The purpose for this project is educational. This project is the first of two to be done in the Data Bootcamp at 🍊 Core Code School.
Requirements for the project is to build a data app. This app should have a backend built with Flask, a frontend built with Streamlit and a database (Postgres or MongoDB).
- API - Python, Flask, PyMongo
- Data - MongoDB, Python, Jupyter Notebook, Pandas, mongoshell script
- Streamlit - Python, Pandas, Streamlit API, Streamlit State API
- Other tools - Commitizen, GitHub Projects, GitHub Actions, Okteto
The project has 3 main services: Database / API(backend) / Streamlit(frontend). Lets describe these services:
The data service is a custom mongodb image where the used data is added to the database in the init phase.
The original csv source/penguins_lter.csv
is transformed into database/docker-entrypoint-initdb.d/seed.json
by running generate-seed-data.py
.
Once having the seed for the database, building the mongo image, mongo-init.js
mongoshell script will create the admin and api users, and create the database with the different collections.
kaggle-raw-data
- the seed.json itselfng-species-raw-data
- the species.json collection from web-scrapping NGindividuals
- collection with each penguin information regarding measures, each document has pointers toislands
,regions
,species
,studynames
islands
- collection with the data regarding the islandregions
- collection with the data regarding the regionspecies
- collection with the data regarding the speciesstudynames
- collection with the data regarding the species
This collections are extracted from kaggle-raw-data
in order to be able to include extra data for each collection without changing the individuals
collection that is the main one.
The API is a backend service for the streamlit frontend and the one that comunicates with the database. The sub-repo for the api is structured as follows:
main.py
- entry point for the flask server.config.py
- env variables all in the same place.routes
- dir with all the routes, entry point to the API.controllers
- dir with the controllers for each route, responsible to exec the code for that route.libs
- utils used along the project for different porpuses.decorators
- custom decorator methods.
GET - /<collection>
- returns all the documents found for this collection on the databasePATCH - /<collection>/<id>
- modify the document<id>
of the<collection>
. The payload should be compliant with the collections fields.
handle_error
- for each route, this decorator catches the errors and returns a json error response.validate_route
- as root route is based on parameters, this decorator checks the collection exists at the db, if not it throws an error before accesing the controller.
mongo_client
- setup for the mongodb connection usingflask_pymongo
.response
- utils to return different responses.
This is the service where the data is displayed. This sub-repo is structured as following:
main.py
- entry point for the streamlit apputils
- dir with methods used along the projectpages
- dir with the pages available in the streamlit appcomponents
- dir with the components used along the projectapi
- dir with the methods used to call backend to retrieve data
You can clone the repo and run docker-compose up
.
Env variables needed to run the project
-
MONGO_URI
- uri for MongoDB DB (incl. db-name). -
MONGO_DBNAME
- database name where all data will be stored. -
MONGO_ADMIN_USERNAME
- username for the database admin user. -
MONGO_ADMIN_PASSWORD
- password for the database admin user. -
MONGO_API_USERNAME
- username for the database user used in the api. -
MONGO_API_PASSWORD
- password for the database user used in the api. -
FLASK_DEBUG
- flag to run Flask in debug mode,False
orTrue
. -
FLASK_ENV
- environment where Flask is running,development
. -
API_URL
- url for the API. -
API_PORT
- the port where the API will be available.
Some features have been not included on this first version, so here are some WIP and future work to be done on this repo:
- Production pipeline for API and Streamlit
- Refactor MongoDB seed
- Add Auth to Flask API
- Enable PDF download of visualizations
- Add more visualizations
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.