Quantcast
Channel: ggplot2 – DataScience+
Viewing all articles
Browse latest Browse all 47

Visualizations for credit modeling in R

$
0
0

Category

Tags

Visualization is a great way to get an overview of credit modeling. Typically you will start by making data management and data cleaning and after this, your credit modeling analysis will start with visualizations. This article is, therefore, the first part of a credit machine learning analysis with visualizations. The second part of the analysis will typically use logistic regression and ROC curves.

Library of R packages

In the following section we will use R for visualization of credit modelling. First we read the packages into the R library:

# Data management packages
library(readr) 
library(lubridate)
library(magrittr)
library(plyr)
library(dplyr) 
library(gridExtra) 
# Visualization packages
library(ggplot2) 
library(plotly)
library(ggthemes) 

Load dataset and data management

Next it is time to read the dataset and do some data management. We use the lending club loan dataset:

# Read the dataset into R library
loan <- read.csv("/loan.csv")
# Data management of the dataset
loan$member_id <- as.factor(loan$member_id)
loan$grade <- as.factor(loan$grade)
loan$sub_grade <- as.factor(loan$sub_grade)
loan$home_ownership <- as.factor(loan$home_ownership)
loan$verification_status <- as.factor(loan$verification_status)
loan$loan_status <- as.factor(loan$loan_status)
loan$purpose <- as.factor(loan$purpose)

After the above data management it is time for data selection and data cleaning:

# Selection of variables for the analysis
loan <- loan[,c("grade","sub_grade","term","loan_amnt","issue_d","loan_status","emp_length",
                          "home_ownership", "annual_inc","verification_status","purpose","dti",
                          "delinq_2yrs","addr_state","int_rate", "inq_last_6mths","mths_since_last_delinq",
                          "mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc")]
# Data cleaningt for missing observations
loan$mths_since_last_delinq[is.na(loan$mths_since_last_delinq)] <- 0
loan$mths_since_last_record[is.na(loan$mths_since_last_record)] <- 0
var.has.na <- lapply(loan, function(x){any(is.na(x))})
num_na <- which( var.has.na == TRUE )	
loan <- loan[complete.cases(loan),]
skim(loan)
Skim summary statistics
 n obs: 886877 
 n variables: 23 

-- Variable type:factor --------------------------------------------------------
            variable missing complete      n n_unique                                       top_counts ordered
          addr_state       0   886877 886877       51      CA: 129456, NY: 74033, TX: 71100, FL: 60901   FALSE
          emp_length       0   886877 886877       12  10+: 291417, 2 y: 78831, < 1: 70538, 3 y: 69991   FALSE
               grade       0   886877 886877        7       B: 254445, C: 245721, A: 148162, D: 139414   FALSE
      home_ownership       0   886877 886877        6   MOR: 443319, REN: 355921, OWN: 87408, OTH: 180   FALSE
             issue_d       0   886877 886877      103   Oct: 48619, Jul: 45938, Dec: 44323, Oct: 38760   FALSE
         loan_status       0   886877 886877        8 Cur: 601533, Ful: 209525, Cha: 45956, Lat: 11582   FALSE
             purpose       0   886877 886877       14 deb: 524009, cre: 206136, hom: 51760, oth: 42798   FALSE
           sub_grade       0   886877 886877       35       B3: 56301, B4: 55599, C1: 53365, C2: 52206   FALSE
                term       0   886877 886877        2                   36: 620739,  60: 266138, NA: 0   FALSE
 verification_status       0   886877 886877        3     Sou: 329393, Ver: 290896, Not: 266588, NA: 0   FALSE

-- Variable type:numeric -------------------------------------------------------
               variable missing complete      n     mean       sd     p0      p25      p50      p75       p100     hist
             annual_inc       0   886877 886877 75019.4  64687.38   0    45000    65000    90000    9500000    ????????
            delinq_2yrs       0   886877 886877     0.31     0.86   0        0        0        0         39    ????????
                    dti       0   886877 886877    18.16    17.19   0       11.91    17.66    23.95    9999    ????????
         inq_last_6mths       0   886877 886877     0.69     1      0        0        0        1         33    ????????
               int_rate       0   886877 886877    13.25     4.38   5.32     9.99    12.99    16.2       28.99 ????????
              loan_amnt       0   886877 886877 14756.97  8434.43 500     8000    13000    20000      35000    ????????
 mths_since_last_delinq       0   886877 886877    16.62    22.89   0        0        0       30        188    ????????
 mths_since_last_record       0   886877 886877    10.83    27.65   0        0        0        0        129    ????????
               open_acc       0   886877 886877    11.55     5.32   1        8       11       14         90    ????????
                pub_rec       0   886877 886877     0.2      0.58   0        0        0        0         86    ????????
              revol_bal       0   886877 886877 16924.56 22414.33   0     6450    11879    20833    2904836    ????????
             revol_util       0   886877 886877    55.07    23.83   0       37.7     56       73.6      892.3  ????????
              total_acc       0   886877 886877    25.27    11.84   1       17       24       32        169    ????????

Visualizations for credit modeling

After loading the dataset and data management it is time to make the credit modelling visualizations in R:

# Chart on customers
ggplot(data = loan,aes(x = grade)) + geom_bar(color = "blue",fill ="green") +geom_text(stat='count', aes(label=..count..))+ theme_solarized()
ggplotly(p = ggplot2::last_plot())

The above coding gives us the following graph:

Let’s look at which grading group are house owners:

# Chart on customers living
ggplot(data = loan,aes(x = home_ownership)) + geom_bar(color = "blue",fill ="green") +geom_text(stat='count', aes(label=..count..))+ theme_solarized()
ggplotly(p = ggplot2::last_plot())

This gives us the following bar plot:

Now for the next visualizations, we need to make some data management:

# Data management for loan status
revalue(loan$loan_status, c("Does not meet the credit policy. Status:Charged Off" = "Charged Off")) -> loan$loan_status
revalue(loan$loan_status, c("Does not meet the credit policy. Status:Fully Paid" = "Fully Paid")) -> loan$loan_status
loan %>% group_by(loan$loan_status) %>% dplyr::summarize(total = n()) -> loan_status_data
loan %>% group_by(loan$loan_status) %>% dplyr::summarize(total = n()) -> loan_status_data
# Chart with customer living and loan status
ggplot(data=loan, aes(x=home_ownership, fill=loan_status)) + geom_bar()
ggplotly(p = ggplot2::last_plot())

The above coding gives us the following visualization:

Now lets look at customers on loan verification:

# Customer and loan verification
ggplot(data=loan, aes(x=verification_status, fill=loan_status))+ geom_bar()
ggplotly(p = ggplot2::last_plot())

This gives us the following plot:

Lets look at the loan verification as loan amount and interest rate graph:

# Loan amount
ggplot(data = loan,aes(x = loan_amnt)) + geom_bar(color = 'green')
ggplotly(p = ggplot2::last_plot())
# Interest rate
ggplot(data = loan,aes(x = int_rate))+ geom_bar(color = 'green')
ggplotly(p = ggplot2::last_plot())

This gives the following two graphs:

Now lets look at histogram based upon loan amount and interest rate:

#Histogram on loan amount
ggplot(data = loan,aes (x = loan_amnt,fill= grade))+ geom_histogram()
ggplotly(p = ggplot2::last_plot())
#Histogram  on interest rate
ggplot(data = loan,aes (x = int_rate,fill= grade))+ geom_histogram()
ggplotly(p = ggplot2::last_plot())

This gives us the following two histograms:

Now let’s look at density plot based on interest rate and loan amount:

# Density on interest rate
ggplot(data = loan,aes(x = int_rate)) + geom_density(fill = 'green',color = 'blue')
ggplotly(p = ggplot2::last_plot())
# Density on loan amount
ggplot(data = loan,aes(x = loan_amnt)) + geom_density(fill = 'green',color = 'blue')
ggplotly(p = ggplot2::last_plot())

This gives us the following density plots:

Next, it is time to look at the density plot on loan- and interest rate based grade type

#density on loan based on grade type
ggplot(data = loan,aes(x = loan_amnt,fill = grade)) + geom_density()
ggplotly(p = ggplot2::last_plot())
#density on interest rate based on grade type
ggplot(data = loan,aes(x = int_rate,fill = grade)) + geom_density()
ggplotly(p = ggplot2::last_plot())

This gives us the following plots:

Lastly let us look at box plots for interest rate based on purpose and grade:

# Box plot interest rate & purpose
boxplot(int_rate ~ purpose, col="darkgreen", data=loan)
# Boxplot interest rate & grade 
boxplot(int_rate ~ grade, col="darkgreen", data=loan)

The above coding gives us the following two histograms:


References

  1. Using readr in R – CRAN.R-project.org
  2. Using lubridate in R – CRAN.R-project.org
  3. Using magrittr in R – CRAN.R-project.org
  4. Using plyr in R – CRAN.R-project.org
  5. Using dplyr in R – CRAN.R-project.org
  6. Using gridExtra in R – CRAN.R-project.org
  7. Using ggplot2 in R – CRAN.R-project.org
  8. Using plotly in R – CRAN.R-project.org
  9. Using ggthemes in R – CRAN.R-project.org

Related Post


Viewing all articles
Browse latest Browse all 47

Trending Articles