Authors:
(1) Anna-Katharina Wickert, Technische Universität Darmstadt, Darmstadt, Germany (wickert@cs.tu-darmstadt.de);
(2) Lars Baumgärtner, Technische Universität Darmstadt, Darmstadt, Germany (baumgaertner@cs.tu-darmstadt.de);
(3) Florian Breitfelder, Technische Universität Darmstadt, Darmstadt, Germany (florian.breitfelder@tu-darmstadt.de);
(4) Mira Mezini, Technische Universität Darmstadt, Darmstadt, Germany (mezini@cs.tu-darmstadt.de).
Table of Links
3 Design and Implementation of Licma and 3.1 Design
4 Methodology and 4.1 Searching and Downloading Python Apps
4.2 Comparison with Previous Studies
5 Evaluation and 5.1 GitHub Python Projects
6 Comparison with previous studies
9 Conclusion, Acknowledgments, and References
4 METHODOLOGY
To analyze Python applications, we constructed two distinct data sets of popular Python and MicroPython projects. Furthermore, we compared our findings in Python programs with previous studies about Java and C code.
4.1 Searching and Downloading Python Apps
Both data sets represent very different domains where Python is used, ranging from server and desktop use to low-level embedded code. How we selected the projects in both data sets for our empirical study is described in the following.
4.1.1 Python Projects from GitHub. For our evaluation of crypto misuses in Python code we focus on open-source code. Thus, we crawled and downloaded the top 895 Python repositories from GitHub sorted by stars. To further understand the influence of dependencies, we downloaded them with Pythons standard dependency manager pip for each project. Afterwards, we ended up with 14,442 Python packages of which 3,420 are unique.
As our analysis works upon a per-file basis, we reduced our set to only those source code files that include the function calls referenced in our rules, e.g., AES.new (...) . In addition, we filter for production code and ignore test code which should be non-existent during the execution of the application. After applying these 2 filter steps, we ended up with 946 source files from 155 different repositories. Unfortunately, Babelfish was unable to parse 35 of these files, and reached the maximum recursion depth for the AST XPath queries for at least one rule in 50 files. These 85 parsing failures are distributed amongst 61 different projects. However, for each of the projects at least 1 file with a crypto usage was analyzed successfully. In total, we successfully analyzed 861 different files within 155 Python repositories with LICMA.
4.1.2 Curated Top MicroPython Projects. As an extension to our Python application set, we crawled 51 MicroPython projects which are stated as the top announced MicroPython projects[5]. Like for the regular Python applications, we downloaded all dependencies with pip and got 113 dependencies with 1 duplicate dependency. Afterwards, we applied the same filter steps as before: The usage of crypto and the exclusion of test files. These steps, resulted in 5 files which seem to use the Python crypto libraries supported by LICMA. Note that we included the MicroPython crypto library ucryptolib in LICMA and our filtering steps. To further understand this small number of potential usages, we also analyzed our data set of MicroPython applications manually. This analysis reveals that we potentially missed five crypto usages.
This paper is available on arxiv under CC BY 4.0 DEED license.
[5] https://awesomeopensource.com/projects/micropython
Lead image by Roman Synkevych on Unplash