Revize 1187e871
Přidáno uživatelem Petr Hlaváč před asi 4 roky(ů)
python-module/Utilities/Crawler/BasicCrawlerFunctions.py | ||
---|---|---|
5 | 5 |
|
6 | 6 |
# Path to crawler logs |
7 | 7 |
CRAWLER_LOGS_PATH = "CrawlerLogs/" |
8 |
# Path to crawled data |
|
9 |
CRAWLED_DATA_PATH = "CrawledData/" |
|
8 | 10 |
|
9 | 11 |
|
10 | 12 |
def get_all_links(url): |
... | ... | |
98 | 100 |
url_parts = url.split("/") |
99 | 101 |
file_name = url_parts[len(url_parts)-1] |
100 | 102 |
|
101 |
path = CRAWLER_LOGS_PATH + dataset_name + '/' |
|
103 |
log_path = CRAWLER_LOGS_PATH + dataset_name + '/' |
|
104 |
data_path = CRAWLED_DATA_PATH + dataset_name + '/' |
|
102 | 105 |
|
103 | 106 |
# download file chunk by chunk so we can download large files |
104 |
with open(path + file_name, "wb") as file: |
|
107 |
with open(data_path + file_name, "wb") as file:
|
|
105 | 108 |
for chunk in r.iter_content(chunk_size=1024): |
106 | 109 |
|
107 | 110 |
# writing one chunk at a time to file |
Také k dispozici: Unified diff
Re #7966
- Vytvoreny pomocne skripty pro spravu datasetu