Category
Tags
Visualization is a great way to get an overview of credit modeling. Typically you will start by making data management and data cleaning and after this, your credit modeling analysis will start with visualizations. This article is, therefore, the first part of a credit machine learning analysis with visualizations. The second part of the analysis will typically use logistic regression and ROC curves.
Library of R packages
In the following section we will use R for visualization of credit modelling. First we read the packages into the R library:
# Data management packages library(readr) library(lubridate) library(magrittr) library(plyr) library(dplyr) library(gridExtra) # Visualization packages library(ggplot2) library(plotly) library(ggthemes)
Load dataset and data management
Next it is time to read the dataset and do some data management. We use the lending club loan dataset:
# Read the dataset into R library loan <- read.csv("/loan.csv") # Data management of the dataset loan$member_id <- as.factor(loan$member_id) loan$grade <- as.factor(loan$grade) loan$sub_grade <- as.factor(loan$sub_grade) loan$home_ownership <- as.factor(loan$home_ownership) loan$verification_status <- as.factor(loan$verification_status) loan$loan_status <- as.factor(loan$loan_status) loan$purpose <- as.factor(loan$purpose)
After the above data management it is time for data selection and data cleaning:
# Selection of variables for the analysis loan <- loan[,c("grade","sub_grade","term","loan_amnt","issue_d","loan_status","emp_length", "home_ownership", "annual_inc","verification_status","purpose","dti", "delinq_2yrs","addr_state","int_rate", "inq_last_6mths","mths_since_last_delinq", "mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc")] # Data cleaningt for missing observations loan$mths_since_last_delinq[is.na(loan$mths_since_last_delinq)] <- 0 loan$mths_since_last_record[is.na(loan$mths_since_last_record)] <- 0 var.has.na <- lapply(loan, function(x){any(is.na(x))}) num_na <- which( var.has.na == TRUE ) loan <- loan[complete.cases(loan),] skim(loan) Skim summary statistics n obs: 886877 n variables: 23 -- Variable type:factor -------------------------------------------------------- variable missing complete n n_unique top_counts ordered addr_state 0 886877 886877 51 CA: 129456, NY: 74033, TX: 71100, FL: 60901 FALSE emp_length 0 886877 886877 12 10+: 291417, 2 y: 78831, < 1: 70538, 3 y: 69991 FALSE grade 0 886877 886877 7 B: 254445, C: 245721, A: 148162, D: 139414 FALSE home_ownership 0 886877 886877 6 MOR: 443319, REN: 355921, OWN: 87408, OTH: 180 FALSE issue_d 0 886877 886877 103 Oct: 48619, Jul: 45938, Dec: 44323, Oct: 38760 FALSE loan_status 0 886877 886877 8 Cur: 601533, Ful: 209525, Cha: 45956, Lat: 11582 FALSE purpose 0 886877 886877 14 deb: 524009, cre: 206136, hom: 51760, oth: 42798 FALSE sub_grade 0 886877 886877 35 B3: 56301, B4: 55599, C1: 53365, C2: 52206 FALSE term 0 886877 886877 2 36: 620739, 60: 266138, NA: 0 FALSE verification_status 0 886877 886877 3 Sou: 329393, Ver: 290896, Not: 266588, NA: 0 FALSE -- Variable type:numeric ------------------------------------------------------- variable missing complete n mean sd p0 p25 p50 p75 p100 hist annual_inc 0 886877 886877 75019.4 64687.38 0 45000 65000 90000 9500000 ???????? delinq_2yrs 0 886877 886877 0.31 0.86 0 0 0 0 39 ???????? dti 0 886877 886877 18.16 17.19 0 11.91 17.66 23.95 9999 ???????? inq_last_6mths 0 886877 886877 0.69 1 0 0 0 1 33 ???????? int_rate 0 886877 886877 13.25 4.38 5.32 9.99 12.99 16.2 28.99 ???????? loan_amnt 0 886877 886877 14756.97 8434.43 500 8000 13000 20000 35000 ???????? mths_since_last_delinq 0 886877 886877 16.62 22.89 0 0 0 30 188 ???????? mths_since_last_record 0 886877 886877 10.83 27.65 0 0 0 0 129 ???????? open_acc 0 886877 886877 11.55 5.32 1 8 11 14 90 ???????? pub_rec 0 886877 886877 0.2 0.58 0 0 0 0 86 ???????? revol_bal 0 886877 886877 16924.56 22414.33 0 6450 11879 20833 2904836 ???????? revol_util 0 886877 886877 55.07 23.83 0 37.7 56 73.6 892.3 ???????? total_acc 0 886877 886877 25.27 11.84 1 17 24 32 169 ????????
Visualizations for credit modeling
After loading the dataset and data management it is time to make the credit modelling visualizations in R:
# Chart on customers ggplot(data = loan,aes(x = grade)) + geom_bar(color = "blue",fill ="green") +geom_text(stat='count', aes(label=..count..))+ theme_solarized() ggplotly(p = ggplot2::last_plot())
The above coding gives us the following graph:
Let’s look at which grading group are house owners:
# Chart on customers living ggplot(data = loan,aes(x = home_ownership)) + geom_bar(color = "blue",fill ="green") +geom_text(stat='count', aes(label=..count..))+ theme_solarized() ggplotly(p = ggplot2::last_plot())
This gives us the following bar plot:
Now for the next visualizations, we need to make some data management:
# Data management for loan status revalue(loan$loan_status, c("Does not meet the credit policy. Status:Charged Off" = "Charged Off")) -> loan$loan_status revalue(loan$loan_status, c("Does not meet the credit policy. Status:Fully Paid" = "Fully Paid")) -> loan$loan_status loan %>% group_by(loan$loan_status) %>% dplyr::summarize(total = n()) -> loan_status_data loan %>% group_by(loan$loan_status) %>% dplyr::summarize(total = n()) -> loan_status_data # Chart with customer living and loan status ggplot(data=loan, aes(x=home_ownership, fill=loan_status)) + geom_bar() ggplotly(p = ggplot2::last_plot())
The above coding gives us the following visualization:
Now lets look at customers on loan verification:
# Customer and loan verification ggplot(data=loan, aes(x=verification_status, fill=loan_status))+ geom_bar() ggplotly(p = ggplot2::last_plot())
This gives us the following plot:
Lets look at the loan verification as loan amount and interest rate graph:
# Loan amount ggplot(data = loan,aes(x = loan_amnt)) + geom_bar(color = 'green') ggplotly(p = ggplot2::last_plot()) # Interest rate ggplot(data = loan,aes(x = int_rate))+ geom_bar(color = 'green') ggplotly(p = ggplot2::last_plot())
This gives the following two graphs:
Now lets look at histogram based upon loan amount and interest rate:
#Histogram on loan amount ggplot(data = loan,aes (x = loan_amnt,fill= grade))+ geom_histogram() ggplotly(p = ggplot2::last_plot()) #Histogram on interest rate ggplot(data = loan,aes (x = int_rate,fill= grade))+ geom_histogram() ggplotly(p = ggplot2::last_plot())
This gives us the following two histograms:
Now let’s look at density plot based on interest rate and loan amount:
# Density on interest rate ggplot(data = loan,aes(x = int_rate)) + geom_density(fill = 'green',color = 'blue') ggplotly(p = ggplot2::last_plot()) # Density on loan amount ggplot(data = loan,aes(x = loan_amnt)) + geom_density(fill = 'green',color = 'blue') ggplotly(p = ggplot2::last_plot())
This gives us the following density plots:
Next, it is time to look at the density plot on loan- and interest rate based grade type
#density on loan based on grade type ggplot(data = loan,aes(x = loan_amnt,fill = grade)) + geom_density() ggplotly(p = ggplot2::last_plot()) #density on interest rate based on grade type ggplot(data = loan,aes(x = int_rate,fill = grade)) + geom_density() ggplotly(p = ggplot2::last_plot())
This gives us the following plots:
Lastly let us look at box plots for interest rate based on purpose and grade:
# Box plot interest rate & purpose boxplot(int_rate ~ purpose, col="darkgreen", data=loan) # Boxplot interest rate & grade boxplot(int_rate ~ grade, col="darkgreen", data=loan)
The above coding gives us the following two histograms:
References
- Using readr in R – CRAN.R-project.org
- Using lubridate in R – CRAN.R-project.org
- Using magrittr in R – CRAN.R-project.org
- Using plyr in R – CRAN.R-project.org
- Using dplyr in R – CRAN.R-project.org
- Using gridExtra in R – CRAN.R-project.org
- Using ggplot2 in R – CRAN.R-project.org
- Using plotly in R – CRAN.R-project.org
- Using ggthemes in R – CRAN.R-project.org
Related Post