Strong enough to be vulnerable

It was simple. The class sat in a semi-circle. We had to walk in and make eye contact with each person from the left to right and then walk out. Fourteen classmates and the teacher. The purpose of…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Packet Harvester Data Analysis

Project for digital academy: data

Our journey through the Digital Academy began with the topic Analysis of Residential Parking in Brno. Various open data can be found directly on Brno web pages but majority of the data we needed were not public yet and could only be received on demand. Our original goal was to find out what the gradual changes in the system would bring, and to analyse associated consequences of these changes that could be identified by number of authorised parking licenses, imposed fines, presence of non-residents etc. We decided on which platform we want to maintain the project, make changes and cooperate before obtaining any data. So, we found storing the project on GitHub to be a very handy way in which no change is irretrievably lost or where it is even possible to work on the same files in parallel. Then we chose a program for project management on our local computers through image of the GitHub project — SourceTree.

Hooray! After a long time, we had our first data to work with and we started to process them. What we got were geographic data together with some properties of individual zones in the json format. We processed them using Python. From the data provided, we calculated areas of each residential parking zones and paid parking areas, identified zone time validities. For the data prepared this way we created our own database using the Docker and using the sqlalchemy library we connected to this database directly with the data processing script in Python and imported the extracted data into it. The first hackathon thus brought us a quite diverse composition of the tools used and a basis for processing other data, which were to come in the foreseeable future.

Oh no! Unfortunately, we did not receive any new data. Moreover, working with the existing data would not contribute to any further investigation because the amount of it was insufficient. So what now ? We tried to get back to the people that were cooperating with us then we waited a while because the content and scope of data were communicated clearly. As days went by, we gradually realized that what we should probably do was to give up on project that we have done so far and look around for a new topic for our project. Open data on well known web pages ? Let’s do this. After some time spent on research, we agreed that we do not want to step on this path. Any other plan? Of course ! We will try to get the data from our company so that the data will not be export controlled or sensitive to further processing on this project.

It was the eighth of November. Although “we found out that even on yesterday it was too late” (we were running out of time) and there were only three weeks left until the project deadline we were still not giving up. In other words the plan was to come to the second hackathon with a clear vision and collected data. And it worked.

One of Honeywell’s divisions is aerospace and it deals with system solutions for aircrafts of various types. Accordingly, as it is typical for the aviation industry, no change can be left without documentation and archiving. In the case of our scope, the transportation plane from company COMAC, all system requirements are written mainly in text form. These requirements are then further propagated to the level of software requirements in a similar text structure which are then implemented into software code and into selected physical units. The system requirements are then verified by designated test procedures which belong to another broad category — validation and verification documents. Other large groups are hardware, certification and flight mode control panel documents. After the process of creating or editing the relevant document, a so-called packet is made. PacketCreator is Honeywell’s program in which the packet is created. This packet consists of the basic information about location of the document, its product type, summary of changes done, changed document itself, comparison of the new document with the previous version, related reports and attachments. The packet is then reviewed by several people that add comments to this packet if they have any and then the moderation where all the comments are discussed is done. So the packet also contains information about the people who participated in the revision of the change, their comments that were removed for the scope of this project and information about the comments, for example the type of the defect. These types can be divided into four large groups: technical, non-technical, process and pointing out that the commented part is not wrong.

Data from packets that carry basic information about the type of changed document, participants, moderators, stamps given, effort of people, meeting details, sent and closure dates, comments information are harvested with program into the company database. What was not included were the documents, diffs of documents, attachments, checklists, etc.

Whereas engineers are focused on technical content of the comments added to the packets and on implementation of the suggested changes, we wanted to look closer on other details that are part of packets and therefore the aim of our project was to analyze the other attributes from packet harvester. Those can tell more about the time that the processes take according to type of change as well as how filling completeness evolved in time. Other interesting objects to study were defectiveness of changes made since the project started, across the suppliers and how different can answers be when the users have the option of open clause answers.

Data model I.

To get an overview of the data, relations between tables and their connections, we created a data model in the Lucid Chart program.

Data extraction

The original data were stored in a protected company database accessible only to employees and locked for editing. Using Python, we extracted these data into multiple csv format tables. With the result of that, we uploaded them to our new project repository.

Data cleaning

Although the data were extracted from the database a cleanup was required. One of the reasons was that some packet fields have been filled in by users. As a result, seemingly clear sets of responses have multiplied several times, so instead of 5 answers there were 80 with no difference in meaning but total difference in wording. Similarly, there were values that differed only slightly but this change in a contrast came from change in dropdown lists of a program in which the packet was created. Therefore, we could consider them as find and replace values such as “and” , “&” in order to make data more consistent. Over time we have also found out that even though each packet has its own unique ID, its Review Id or in other words its name was occurring in the data several times. The reason was that the company program harvested the data from certain number of pdfs (that sometimes repeated) to the database several times at different dates or times. Consequently, some sent dates were also incorrect. Last but not least were very well-known blanks and nulls.

Since we decided to import the data into Power BI via Python scripts almost all cleanup for each table was done directly in the import Python script. On the one hand, pandas library succeeded as very useful lib for working with tables in the csv format but on the other hand some methods and function seemed to be too complicated while comparing it to SQL language, so we decided to use also the pandasql library.

Data processing

We already had the necessary data so why not to ‘throw’ it directly into the PBI? Because there were still some things to sort, assign, simplify and calculate. Not only our opinion is that PBI is not intended for calculations but primarily for visualizations, but also we wanted to avoid clicking thousand times in case of that some mistake was made (and that has btw very high probability considering the data and tools and theses are all quite new for us).

Document types and suppliers of change made — two attributes that needed to be simplified for graphical interpretation. For document types, we went through the table rawtypes and assigned one abbreviation of the new type from the set we selected to each old type. This was followed by the creation of a new table with our set of new types — a code list assigning the abbreviation of the new type to the full name of the new type.

With Python, we created a new column based on the ReviewId column, in which we extracted the key letter specifying the supplier and again we created a new table — a code list assigning the appropriate letter to the full name of the supplier.

The third simplification approach was direct data rewriting in Python as the packet Status column needed to be simplified only from 9 items to 4.

To plot the completeness of the data over time, we decided to create a new table. This consisted of the original columns such as PacketId, SentDate or ReviewType and also out of selected data groups. For this selection of columns we created a script in Python that expects a list of the correct values ​​of the given column as an input and according to this list divides the original column and returns four groups: correct, incorrect value, blank and NA.

Data model II.

Since, the tables were changed and some new were added a data model was modified and this is its final version.

As we chose a very specific technical topic that an average reader wouldn’t easily understand without further explanation, we decided to dedicate the first page to a brief intro into the matter of our analysis. We graphically interpreted the whole change process for which a packet is generated, and also included the main information about the COMAC C919 aircraft. We also wanted to show the main characteristics of the dataset we worked with, for which we chose simple, yet effective card visuals. Our dataset wasn’t one of the largest as it contained only around 10.6K rows, but each packet was assigned many attributes which made the ‘packets’ table very wide. Therefore, the packet counter can be a useful information for the tool maintenance team, because once the database grows too „large“, it might be preferable to divide that one wide table into more smaller tables to make the calculations and database operations easier and quicker. Interesting is also the number of participants, that contributed to the change processes in the observed time period (2012 to 2021). As development teams are usually several closed groups consisting of limited number of engineers, 2134 people might seem a surprising figure. The last card shows an average number of comments per one packet, which naturally depends on many factors such as document type, its length etc., but once again provides a general overview of how complex these review and change processes are.

This part aims on the effort spent on different types of the documents. From the results is it clear that the document changes regarding Flight Mode Control Panel are significantly the most time exhausting for both reviewers and for the all people included in the approval process. This can be addressed to the length of the documents as long as their level of technical complexity. What is interesting, that document changes regarding system, software requiements and their implementation into a real hardware are each less time consuming than validation and verification of these changes. The reason is that in majority of cases for each requirement multiple test case scenarios are made, then tested and also the results are evaluated. From the graph it can be seen that there were so many document types not belonging to any big group. They were gathered to the group Other and therefore they hold the third place in a graph.

As we already mentioned in the project abstract, we wanted to look at the packet database from a „different perspective“, or a different point of view. For the engineers, the only truly important column in the database are comments, or more specifically the comments’ contents. These specify what is wrong in the changed document and what needs to be fixed/deleted/changed in any way. However, if you looked into our data models, you wouldn’t find any Comments column. There is a table called comments, but it doesn’t contain the comments’ texts. How so? Why not include it, when it’s the most important one? Well, there are two reasons. First, the technical data of development project are confidential and we don’t have permission to share them with other people. Second, information contained in those columns were probably analysed many times, therefore we would only mean duplicating work that has been already done by someone else before. Instead, we focused on the other attributes and tried to look for patterns that could lead us to interesting conslutions or possible suggestions for improvement. As we were working with the data, we stumbled upon significant inconsistencies in completeness and correctness of some attributes. The first example is the „Meeting Location“ field. As you can see in the picture, during the first years of the development people didn’t leave the field blank, but once the amount of packets drastically increased (which was mainly in 2018 and 2019), people started to leave the field blank or inserted incorrect information, i.e. details instead of location. One possible cause is that the number of fields to fill in has significantly increased throughout the years, and people simply started to skip the optional ones. For purposes of easier analyses and increased validity of the data, it might be useful to create a list of predefined options (plus possibly leave an option to insert a custom one, if necessary) as it would help to avoid the issue with „one thing written 63 different ways“.

The next part of our analysis is addressed to completeness of the packet cells. For this visual we chose two attributes — Meeting details and Load Number — each filled by user from the beginning of the project. Graph interprets the rate of three groups (Blank, Correct, Incorrect) out of all packets sent in given year in percentage. On one hand we see that the elimination of incorrect filling and blank fields basically disapeared with a time for the high priority information such as load number but on the other hand with lower priority information as meeting details the problems with blank fields and incorrect values still persist.

During our analyses of the pattern of changes in completeness throughout the years, we found some attributes for which the blank values completely disappeared after a certain date, or, on the other hand, started appearing after a certain date. From these graphs, it can be deduced which attributes were present in the tool since the beginning and became mandatory after a certain date (Producer Site, Moderation Effort), which were added after a certain date (Moderator Site), and which were removed (Meeting Date). It wasn’t possible to identify a specific date of this change, because the packets are not created on a daily basis, but it’s obvious that a major update to the tool was made in 2015.

The right graph represents quality of changes made on whole packet. Once, the packet is labeled with one of these categories, except for status Accepted As Is, the lifecycle of the change can not be completed and change has to go backwards to some step of the process and the steps have to be repeated. The most time demanding (and also expensive) is work deferred. The left graph represents how many technical defects on average per packet had each supplier of the change throughout the whole project till now. Our thesis was not confirmed fully because only FACRI had more defects per packet than Honeywell.

The last part of our analysis was dedicated to the review process. Review is the key part of the whole change evaluation, where several technical experts are given the packet and their task is to check if the changes were made properly and the document doesn’t contain any errors. If they identify an error (a defect), they insert a comment in which the defect is described. After the review is complete, a moderation meeting is held, where all the comments are checked, evaluated, and a defect type is assigned to each error found. Reviewers are usually invited to the moderation meeting as they are technical specialists with vast knowledge in the particular area so their presence is desirable, however, it is not mandatory for all of them to attend. We decided to look into this a bit deeper and find the percentage of the reviewers who attend the moderation meetings. Result of 28% might seem insufficient, but for logical reasons it is not possible for all reviewers to attend all of the meetings as they take place in different locations and the time zones have to be taken into account, another factor might be that the meetings are a very time consuming activity. The very last graph shows a relation between an average number of comments and the overall review time. Our hypothesis was that the more time is spent on one review (by one person), the more comments are generated. Unfortunately, the validity of this hypothesis was not unambiguously proven as the average review times and numbers of comments greatly vary without a prominent trend, therefore no certain conclusion can be made.

We think that even though the data academy and our project (the new beginning) are coming to the end, it is just the beginning. Thanks to this intensive course we have broadened our horizons and gained an overview of possible data processing tools. We are already coming up with some ideas to simplify our day-to-day work. After all, the whole project was built on the fact that due to the time-consuming search for the necessary data in one system another path opened up for me — a short cut that I would have probably rejected without at least a basic knowledge of SQL language. In the end, the database proved to be a suitable data set not only for a simple “SELECT” by date but also for a more interesting analysis.

Petra Benešová

When I was applying for the Digital Academy: Data, I had no idea how challenging this experience was gonna be. As a fresh graduate, I felt like the „learning“ phase of my life was over and I was ready to tackle every problem — after five years of studies, how could I possibly get stuck and not know what to do? Oh well…you proved me wrong :D As a complete opposite of someone you’d call a creative person, I was more than happy to team up with Simča, who came up with a couple ideas for our data project and after a thorough evaluation (and the Meet your mentor event), we decided to stick with the analysis of Brno residential parking. And that’s where the tough part came. As we soon realized, it wasn’t possible to obtain all the data necessary for a complex analysis that would bring interesting and useful outcomes, and the visualization of geographical data also turned out to be very difficult. We created scripts for area calculation of the zones, learnt how to work with json structures, but it was far from enough. After two or three weeks of struggling, we had to admit that this was NOT gonna work. It was November 8th. 19 days till the deadline….and we had to start from the scratch, including the choice of a new topic! I can’t express how grateful I am that Simča didn’t give up at that moment, but actively reached to her colleagues in Honeywell and we obtained data from Packet Harvester. From that moment, we knew that we had to hurry up and get the work done as soon as possible. During the second hackathon, we wrote a script to get the data from Honeywell database to csv files, which we directly imported into Power BI, and drafted a couple questions and hypotheses we’d like to analyse. Once this was done, it was all about realization and visualization of said theses. Simča (as a proven Python expert) mostly took care of the programming part, while I was responsible for creation of the data models, visualizations in Power BI and tuning the design. However, we were cooperating and helping each other most of the time, spending the cold fall evenings talking over Teams for hours. And we made it! When I look back, I see a lot of hard work and stressing over the deadline, but also great fun with Simča throughout the whole course, all the moments of pure concentration during the lectures, and gratitude for the valuable skills obtained during those three months. Big thanks to everyone who was part of this crazy intense study experience: to the lecturers for being so kind and helpful, all the other Czechitas for creating a great team spirit, and last (but not least) to our mentor Pavel for being our mental and technical support.

EDIT (11/27/2021): It’s a nice Saturday evening, one day before the project deadline….and GitHub DOESN’T WORK. This pretty much characterizes how it goes (meaning that we found another way to finish our work, nothing will stop us!)

Simona Sijková

From the beginning when we dealt with the original topic we set up Docker, SourceTree and GitHub together with the help of Pavel. We received the data set just before the hackathon and since Petra was not in full health in the second half of hackathon, I did attributes selection in Python, attaching and storing data in Postgres.

Then we started working on our new project together on the second hackathon where we used Python to export data from the company database to csv files and continued with a more detailed acquaintance with the data and the creation of theses. We ended the hackathon with the first import of the table into PBI. I continued to “automate” data retrieval — creating Python commands to import the necessary tables into the PBI and changing the data types. Subsequently, we continued with iterations of discussions on how to properly clean, simplify and visualize the data. Then, based on the outputs of the discussions I adjusted the scripts for data editing and Petra set the filtering and linking of the data in the data model during long calls on Teams. Over and over again. While I was working on a script to create a completeness table Petra was working on a report design and visualizations but during the process we also switched roles if we came up with something that ‘didn’t work’ for us. While Petra was creating a data model I was engaged in preparing this blog. We will make the presentation together.

We would like to thank all people involved in the lessons — lecturers and assistants, also our organizer Janka and Pavla for very nice career discussions. The special thanks go to our mentor Pavel Mičan. We must point out that the arrangement of whole academy was very well organized despite the pandemic situation. Even if this intense schedule reminded us our Uni times we felt that content of this academy was way more practical, lectors were teaching things from real work and therefore it was a great experience.

Add a comment

Related posts:

Habitat for Humanity

The Westminster Tiny House team is currently involved in a project for this coming holiday season. We are helping out the Lawrance County Chapter of Habitat for Humanity aid a family in need. Our…

Weary Ghost

Waves of cumbersome. “Weary Ghost” is published by Jose Gonzalez.

The Fallacies of Cloud Computing

The original 7 Fallacies of Distributed Computing list was published over two decades ago (the 8th one added in 1997), the common assertions outlined by the list were largely around networking and…