CS F469 IR Assignment - 2
Problem Statement:
We have to implement Local Sensitive Hashing to find out duplicate or similar DNA sequences within the corpus. The steps involved are Shingling, Minhashing and Local Sensitive hashing. The main idea is to hash similar documents into buckets and the documents in a particular bucket have high probability of being similar or duplicates.
About the project
Dataset used - Kaggle-human-data
Have a look at the file Design Architecture. It includes the concepts used along with the time taken for each implementation step.
Project By:
- Kriti Jethlia: Email- [email protected]
- Jui Pradhan: Email- [email protected]
- Anusha Agarwal: Email- [email protected]
-
Clone the repository : https://github.com/KritiJethlia/LSH.git
-
cd LSH
-
Run file:
python3 LSH_program.py
-
Type your query in the terminal and wait till it returns the similar DNA sequence results :)
- time
- collections
- pandas
- pickle
- Numpy
- random
- operator
- sys
- copy