Case 1
Data Manipulation, Cleansing And Joining
- National Grid performs routine, scheduled inspections of our assets in the field to make sure that they are operating safely and effectively. Using the data sets provided via spreadsheet, 1.1 How many Distribution Pole inspections were performed in 2020?
1.2 How many of these inspections earned a failing score (rating >= 3)?
1.3 How did you determine this answer?
Assumption:
- The date fields are not very clear. The Inspection_Year column is empty and the dates are in special format that I have not gotten a chance to decipher yet.
Answer:
1.1 Assumed that all this data is for year 2020 (mentioned above)
Total Distribution Pole inspections: 67080
1.2 Total Distribution Pole inspections with Rating >=3: 3616
1.3 Cleaned up and filtered the DB.EDA.DS_POLE_INSPECTIONS.csv file to get the result.
Case 2
Data Manipulation, Cleansing And Joining
- Overhead transformer installations are very heavy – often weighting hundreds of pounds - and need to be installed on poles that are sturdy enough to support this load. Using the data sets provided in the spreadsheets,
2.1 How many OH Distribution Transformers are installed on class 5 wooden distribution poles under 40 feet in height?
2.2 How did you determine this answer?
Answer:
2.1 Total Overhead Transformers: 3707
Total OH Distribution Transformers that are installed on class 5 wooden distribution poles under 40 feet in height: 189
2.2 Exploration of DB.EDA.VW_DISTRIBUTION_POLE.csv file by filtering, cleansing and joining it with DEVICE file.
Case 3
Data Evaluation And Scoring
- National Grid needs reliable data to drive sound business decisions. Please review the files DB.EDA.VW_DISTRIBUTION_POLE & DB.EDA.DS_POLE_INSPECTIONS, and provide your insights on the following questions:
3.1. Which fields would you speculate are most important to making business decisions using our distribution pole data?
3.2. How would you evaluate the quality of the data?
3.3. Do you see any potential concerns about these data sets?
3.4. What are some ways in which these concerns might these be addressed?
Answer:
I did a quick EDA for these data frames. Here are some thoughts:
Some top level info about the data 1. DS_POLE_INSPECTIONS - 3 columns have nearly 100% missing-ness - POLE_NEW, INSPECTION_YEAR, FUTURE_INSPECTION_YEAR. - Some variables are more skewed than others like - CATV_ATTACH, POLE_RATING, TELEPHONE_ATTACHED and they also have 30% data missing. - NGLOADDATE, NGMODDATE, NGSRCSYSID, NGSRCSYSCREATEDATE, NGSRCSYSMODDATE - have dates in some special format. Also, just one value for the entire file. - It would be great to have a data dictionary or catalogue for someone to understand easily what the data is about and what different columns are doing.
- 80% of the columns data is complete (100% available).
- VW_DISTRIBUTION_POLE
- 3 columns have nearly 100% missing-ness - POLE_NEW, INSPECTION_YEAR, FUTURE_INSPECTION_YEAR.
- There’s no column that’s complete (100% non-missing-ness).
- Data has more
Findings on data quality
Issues
- INSPECTION_YEAR & FUTURE_INSPECTION_YEAR in inspections db is not getting recorded.
- There’s presence of some many-to-many relationships which might be worth a review.
- Bit unclear on what Inspection_Date value of 00:00.0 indicates.