Projekt

Obecné

Profil

« Předchozí | Další » 

Revize af7609b5

Přidáno uživatelem Tomáš Ballák před více než 3 roky(ů)

Re #8193 - refactoring crawler

Zobrazit rozdíly:

modules/crawler/DatasetCrawler/WIFI_crawler.py
1 1
from Utilities import folder_processor
2 2
from Utilities.Crawler import basic_crawler_functions
3
from shared_types import ConfigType
3 4

  
4 5
# Path to crawled data
5 6
CRAWLED_DATA_PATH = "CrawledData/"
6 7

  
7 8

  
8
def crawl(config):
9
def crawl(config: ConfigType):
9 10
    """
10 11
    Implement crawl method that downloads new data to path_for_files
11 12
    For keeping the project structure
......
21 22
    path_for_files = CRAWLED_DATA_PATH + dataset_name + '/'
22 23

  
23 24
    first_level_links = basic_crawler_functions.get_all_links(url)
24
    filtered_first_level_links = basic_crawler_functions.filter_links(first_level_links, "^OD_ZCU")
25
    absolute_first_level_links = basic_crawler_functions.create_absolute_links(filtered_first_level_links, url)
25
    filtered_first_level_links = basic_crawler_functions.filter_links(
26
        first_level_links, "^OD_ZCU")
27
    absolute_first_level_links = basic_crawler_functions.create_absolute_links(
28
        filtered_first_level_links, url)
26 29

  
27 30
    files = []
28 31

  
29 32
    for link in absolute_first_level_links:
30 33
        second_level_links = basic_crawler_functions.get_all_links(link)
31
        filtered_second_level_links = basic_crawler_functions.filter_links(second_level_links, regex)
32
        absolute_second_level_links = basic_crawler_functions.create_absolute_links(filtered_second_level_links, link)
34
        filtered_second_level_links = basic_crawler_functions.filter_links(
35
            second_level_links, regex)
36
        absolute_second_level_links = basic_crawler_functions.create_absolute_links(
37
            filtered_second_level_links, link)
33 38

  
34 39
        for file_link in absolute_second_level_links:
35 40
            files.append(file_link)
36 41

  
37
    files = basic_crawler_functions.remove_downloaded_links(files, dataset_name)
42
    files = basic_crawler_functions.remove_downloaded_links(
43
        files, dataset_name)
38 44

  
39 45
    for file in files:
40 46
        basic_crawler_functions.download_file_from_url(file, dataset_name)

Také k dispozici: Unified diff