Outlier Detection in R

In this post, we will go through different outlier detection methods and topics like,

Power Transformation (Yeo Johnson Transform) and Density Plots
Box Plots and 5 Point Summary Statistics
Z-Score
Histograms and Scatter Plots
Mosaic Plots and Bag Plots
Mahalanobis Distance
Cook’s Distance
DBScan
Local Outlier Factors
1-class Support Vector Machine

Dataset under consideration - Males (from package - Ecdat)

First, we will explore more about data using str(), summary() methods,

library(Ecdat)
summary(Males)

##        nr             year          school          exper        union     
##  Min.   :   13   Min.   :1980   Min.   : 3.00   Min.   : 0.000   no :3296  
##  1st Qu.: 2329   1st Qu.:1982   1st Qu.:11.00   1st Qu.: 4.000   yes:1064  
##  Median : 4569   Median :1984   Median :12.00   Median : 6.000             
##  Mean   : 5262   Mean   :1984   Mean   :11.77   Mean   : 6.515             
##  3rd Qu.: 8406   3rd Qu.:1985   3rd Qu.:12.00   3rd Qu.: 9.000             
##  Max.   :12548   Max.   :1987   Max.   :16.00   Max.   :18.000             
##                                                                            
##     ethn      maried     health          wage       
##  other:3176   no :2446   no :4286   Min.   :-3.579  
##  black: 504   yes:1914   yes:  74   1st Qu.: 1.351  
##  hisp : 680                         Median : 1.671  
##                                     Mean   : 1.649  
##                                     3rd Qu.: 1.991  
##                                     Max.   : 4.052  
##                                                     
##                              industry   
##  Manufacturing                   :1231  
##  Trade                           :1169  
##  Professional_and_Related Service: 333  
##  Business_and_Repair_Service     : 331  
##  Construction                    : 327  
##  Transportation                  : 286  
##  (Other)                         : 683  
##                                occupation            residence   
##  Craftsmen, Foremen_and_kindred     :934   rural_area     :  85  
##  Operatives_and_kindred             :881   north_east     : 733  
##  Service_Workers                    :509   nothern_central: 964  
##  Clerical_and_kindred               :486   south          :1333  
##  Professional, Technical_and_kindred:453   NA's           :1245  
##  Laborers_and_farmers               :401                         
##  (Other)                            :696

str(Males)

## 'data.frame':    4360 obs. of  12 variables:
##  $ nr        : int  13 13 13 13 13 13 13 13 17 17 ...
##  $ year      : int  1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
##  $ school    : int  14 14 14 14 14 14 14 14 13 13 ...
##  $ exper     : int  1 2 3 4 5 6 7 8 4 5 ...
##  $ union     : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ethn      : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ maried    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ health    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wage      : num  1.2 1.85 1.34 1.43 1.57 ...
##  $ industry  : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
##  $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
##  $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...

Train-Test Split

Here, we will split data to training and testing set. 70% will go for training and 30% for testing.

set.seed(7)
# 70% of the sample size
smp_size <- floor(0.7 * nrow(Males))

# Train and Test Split
train_ind <- sample(seq_len(nrow(Males)), size = smp_size)
tr <- Males[train_ind, ]
te <- Males[-train_ind, ]

# Column names split
cols <- colnames(Males)
cols_num <- cols[3:4]
cols_num_all <- c(cols_num, cols[9])
cols_cat <- c(cols[5:8], cols[10:12])

Yeo-Johnson Transformation

Most of the methods will only work well with normalized data. So, we have to transform our variables to another form which are more normally distributed. Yeo-Johnson tansformation will help us here,

library(plotly)
library(recipes)

# Performing Yeo Johnson transform for Normalization
# Setting variable price as outcome and remaining numerics as predictors
yj_estimates <-
  recipe(as.formula(paste("wage ~ ", paste(cols_num, collapse = " + "))), data = tr) %>%
  # Power transformation step
  step_YeoJohnson(all_numeric()) %>%
  # Feeds training data
  prep(data = tr)
# The trained process is run of test set
yj_t <- bake(yj_estimates, te)

Density Plot

To display effect of YJ transform density of one variable before and after transformation is plotted.

de_b <- density(te$exper)
de_b <- data.frame(x = de_b$x, y = de_b$y)
before <- plot_ly(data = de_b, x = ~ x, y = ~ y)
layout(add_lines(before),
       title = "Density plot of non-transformed variable")

de_a <- density(yj_t$exper)
de_a <- data.frame(x = de_a$x, y = de_a$y)
after <- plot_ly(data = de_a, x = ~ x, y = ~ y)
layout(add_lines(after),
       title = "Density plot of transformed variable")

## Univariate Outlier Detection

Box Plots

Now, let’s plot box plots for numeric variables in the dataset. If you see a datapoint outside the whiskers, then you need to examine that particular observation.

bp <- plot_ly(data = yj_t, type = 'box')
for (k in 1:length(cols_num_all)) {
  dfk <- data.frame(y = yj_t[[cols_num_all[k]]])
  bp <-
    add_trace(
      bp,
      y = ~ y,
      data = dfk,
      name = cols_num_all[k],
      notched = TRUE,
      text =  ~ y
    )
}

layout(bp,
       yaxis = list(title = "Value"))

Five Number Summary

Boxplots are created on basis of 5 number summary. This includes,

Minimum
Q1 - 25%
Median
Q3 - 75%
Maximum

Let’s detect 5 number summary for variable wages,

fnum = (fivenum(yj_t$wage))
print(fnum)

## [1] -0.638459  2.035505  2.684175  3.404784  9.405495

# Calculating end point of whiskers
low = fnum[2] - 1.5 * (fnum[4] - fnum[2])
high = fnum[4] + 1.5 * (fnum[4] - fnum[2])
print(paste("Outliers are outside region", low, high))

## [1] "Outliers are outside region -0.0184130137921463 5.45870141858874"

Histogram

A histogram is a frequency distribution of given data. It’s major purpose here is to get a visual idea about skewness, range, value distribution of each numeric variable. Histogram for wage be like,

hist <- plot_ly(data = te,
                x = ~ wage,
                type = "histogram")

layout(hist,
       xaxis = list(title = "wage"),
       margin = list(t = 80))

Z-Score

Z-score is a measure of relationship between an observation with mean and standard deviation of group of observations. Usually, a z-score outside (-3,+3) is considered as a novelty. Outliers found for wage using Z Score is as follows,

rows_w <- c()
dfk <- scale(yj_t$wage)
for (i in 1:length(dfk)) {
  if (dfk[i] > 3 | dfk[i] < -3) {
    rows_w = c(rows_w, i)
  }
}
te[rows_w,]

##         nr year school exper union  ethn maried health       wage
## 1327  2868 1986     13     9    no other    yes     no  3.3975260
## 1734  3525 1985     14     6    no other     no     no  3.4493634
## 2899  6987 1982     15     4    no other    yes     no  3.0858335
## 3051  7784 1982     11     3    no other     no     no  3.4727067
## 3053  7784 1984     11     5    no other     no     no  3.7774921
## 3054  7784 1985     11     6    no other     no     no  4.0518600
## 3056  7784 1987     11     8    no other     no     no  3.0963043
## 3134  8090 1985     14     7    no other    yes     no  3.0114436
## 3135  8090 1986     14     8    no other    yes     no  3.0992747
## 3136  8090 1987     14     9    no other    yes     no  3.0966616
## 3844  9859 1983     12     7   yes  hisp     no     no -0.7909872
## 4297 12410 1980     12     3   yes other     no     no -0.8153648
##                              industry                          occupation
## 1327                            Trade Managers, Officials_and_Proprietors
## 1734 Professional_and_Related Service                       Sales_Workers
## 2899                            Trade Managers, Officials_and_Proprietors
## 3051      Business_and_Repair_Service                       Sales_Workers
## 3053                          Finance                       Sales_Workers
## 3054                          Finance                       Sales_Workers
## 3056                          Finance                       Sales_Workers
## 3134                          Finance                       Sales_Workers
## 3135                          Finance                       Sales_Workers
## 3136                          Finance Managers, Officials_and_Proprietors
## 3844            Public_Administration                Laborers_and_farmers
## 4297                    Manufacturing              Operatives_and_kindred
##       residence
## 1327 north_east
## 1734      south
## 2899 north_east
## 3051      south
## 3053      south
## 3054      south
## 3056      south
## 3134       <NA>
## 3135       <NA>
## 3136       <NA>
## 3844       <NA>
## 4297       <NA>

Bivariate Outlier Detection

Now onwards we will see how to identify outliers when 2 or more variables are combined,

Sactter Plot

From the scatter plot one can manually identify outliers by considering two vairables. Let’s plot for variables school and exper,

h_dat = paste("Obs No :",
              rownames(te))
sc <-
  plot_ly(
    data = yj_t,
    x = ~school,
    y = ~exper,
    hover_data = 'text',
    text = h_dat
  )
layout(sc,
       yaxis = list(title = "school"),
       xaxis = list(title = "exper")
)

## Warning: 'scatter' objects don't have these attributes: 'hover_data'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

Bag Plot

Bag plot is an improved version of box plot capable of finding outliers in 2/3 dimension data. Using package aplpack::bagplot, it was possible to visualize and fetch outliers in selected variables.

library(aplpack)
bagplot(yj_t$school, yj_t$exper)