DatasetCrawler » Historie » Verze 6
Petr Hlaváč, 2020-05-27 08:48
1 | 1 | Petr Hlaváč | h1. DatasetCrawler |
2 | |||
3 | 3 | Petr Hlaváč | Složka obsahuje implementace crawleru pro jednotlivé datasety. Crawlery jsou dynamicky importovány je tedy proto nutné dodržet pojemnování *"dataset-name"*. |
4 | 1 | Petr Hlaváč | |
5 | 2 | Petr Hlaváč | Je velmi žádoucí aby crawler pro stažení souboru používal funkci *basic_crawler_functions.download_file_from_url("url_souboru", "jmeno_datasetu")*. |
6 | 4 | Petr Hlaváč | Ta totiž ukládá do databáze zmínku o tom, které soubory byly již staženy aby nedocházelo k duplicitám nebo zbytečnému stahování. |
7 | 1 | Petr Hlaváč | |
8 | h2. Generovaný crawler |
9 | |||
10 | Při použití skriptu ** vznikne následující kostra pro doplnění funkčnosti. |
11 | |||
12 | <pre> |
13 | # Path to crawled data |
14 | CRAWLED_DATA_PATH = "CrawledData/" |
15 | |||
16 | |||
17 | def crawl(config): |
18 | """ |
19 | Implement crawl method that downloads new data to path_for_files |
20 | For keeping the project structure |
21 | url , regex, and dataset_name from config |
22 | You can use already implemented functions from Utilities/Crawler/ |
23 | |||
24 | Args: |
25 | config: loaded configuration file of dataset |
26 | """ |
27 | dataset_name = config["dataset-name"] |
28 | url = config['url'] |
29 | regex = config['regex'] |
30 | path_for_files = CRAWLED_DATA_PATH + dataset_name + '/' |
31 | print("You must implements Crawl method first!") |
32 | |||
33 | </pre> |
34 | 5 | Petr Hlaváč | |
35 | h2. Vzorově implementovaný crawler |
36 | 6 | Petr Hlaváč | Pro ukázku byl zvonel crawler pro dataset koloběžek. Je zde využito hlavně funkcí z *Utilities.Crawler.basic_crawler_functions*. |
37 | |||
38 | 5 | Petr Hlaváč | |
39 | <pre> |
40 | from Utilities import folder_processor |
41 | from Utilities.Crawler import basic_crawler_functions |
42 | |||
43 | # Path to crawled data |
44 | CRAWLED_DATA_PATH = "CrawledData/" |
45 | |||
46 | |||
47 | def crawl(config): |
48 | """ |
49 | Implement crawl method that downloads new data to path_for_files |
50 | For keeping the project structure |
51 | url , regex, and dataset_name from config |
52 | You can use already implemented functions from Utilities/Crawler/ |
53 | |||
54 | Args: |
55 | config: loaded configuration file of dataset |
56 | """ |
57 | dataset_name = config["dataset-name"] |
58 | url = config['url'] |
59 | regex = config['regex'] |
60 | path_for_files = CRAWLED_DATA_PATH + dataset_name + '/' |
61 | |||
62 | first_level_links = basic_crawler_functions.get_all_links(url) |
63 | filtered_first_level_links = basic_crawler_functions.filter_links(first_level_links, "^OD_ZCU") |
64 | absolute_first_level_links = basic_crawler_functions.create_absolute_links(filtered_first_level_links, url) |
65 | |||
66 | files = [] |
67 | |||
68 | for link in absolute_first_level_links: |
69 | second_level_links = basic_crawler_functions.get_all_links(link) |
70 | filtered_second_level_links = basic_crawler_functions.filter_links(second_level_links, regex) |
71 | absolute_second_level_links = basic_crawler_functions.create_absolute_links(filtered_second_level_links, link) |
72 | final_links = basic_crawler_functions.remove_downloaded_links(absolute_second_level_links, dataset_name) |
73 | |||
74 | for file_link in final_links: |
75 | files.append(file_link) |
76 | |||
77 | for file in files: |
78 | basic_crawler_functions.download_file_from_url(file, dataset_name) |
79 | |||
80 | folder_processor.unzip_all_csv_zip_files_in_folder(path_for_files) |
81 | |||
82 | </pre> |