Exporatory Data Analysis Using Python Pandas

Vivek Singh
2 min readNov 14, 2023

Repository Link : https://github.com/Viveksingh1313/LendingClubCaseStudy

Google collab : https://colab.research.google.com/drive/12s8oBX9N87ZrFASqdiVXWWT8BHPfzlSP?usp=drive_link

The repository has 3 files :

  1. Vivek_Kumar_Singh.ipynb :- Has the coding python notebook file
  2. Vivek_Kumar_Singh.pptx :- Has the ppt about the project and analysis of data
  3. README.md :- Has details on what the project is trying to solve.

Technologies and libraries used : matplotlib, seaborn, pandas, Python, google collab

Project Question : We have been given a loan dataset with multiple columns. There is an online lending company which is trying to reduce it’s credit loss. Applicant applies for loan, and this project has to come with deductions/conclusion whether to approve the loan or not. Credit loss is the amount of money a company loses, if the person/applicant does not pay the loan amount. If we approve the loan for an applicant, which is likely to be a defaulter(not pay the loan), then it’s a loss to the company, similarly if we don’t approve the loan for an applicant who is likely to pay the loan, then it’s a loss.
There are three types of loan status in this csv file : Fully Paid, Charged Off, Current.

Full Paid : One who has paid the loan amount
Charged Off : One who is a defaulter. Has not, and will not pay the loan.
Current : One who is paying the loan, but yet not paid the entire loan.

Objective : We have to find the columns/variables which directly affects loan status column. To be more precise we have to find variables which has a high relation to an applicant being a defaulter.

Approach to solve this

  1. Import the dataset.
  2. Go through the dataset in excel, and figure out the columns which can’t affect loan_status column. Remove those columns
  3. Find null values. Remove columns which has more than 90% of null values.
  4. Remove duplicate rows.
  5. Change the data type of columns according to understanding
  6. Filter out columns . Examples — filter date to day, month and year columns. Remove % from interest_rate columns, etc etc.
  7. Start analysing the data.
  8. Do a univariate analysis
  9. Do a bivariate analysis.

That’s it.

What’s a univariate analysis?

This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes. It does not deal with causes or relationships and the main purpose of the analysis is to describe the data and find patterns that exist within it.

What’s a bivariate analysis?

This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables. Example of bivariate data can be temperature and ice cream sales in summer season.

--

--

Vivek Singh

Software Developer. I write about Full Stack, NLP and Blockchain. Buy me a coffee - buymeacoffee.com/viveksinless