In this post, we will go through different outlier detection methods and topics like,

Dataset under consideration - Males (from package - Ecdat)

First, we will explore more about data using str(), summary() methods,

library(Ecdat)
summary(Males)
##        nr             year          school          exper        union     
##  Min.   :   13   Min.   :1980   Min.   : 3.00   Min.   : 0.000   no :3296  
##  1st Qu.: 2329   1st Qu.:1982   1st Qu.:11.00   1st Qu.: 4.000   yes:1064  
##  Median : 4569   Median :1984   Median :12.00   Median : 6.000             
##  Mean   : 5262   Mean   :1984   Mean   :11.77   Mean   : 6.515             
##  3rd Qu.: 8406   3rd Qu.:1985   3rd Qu.:12.00   3rd Qu.: 9.000             
##  Max.   :12548   Max.   :1987   Max.   :16.00   Max.   :18.000             
##                                                                            
##     ethn      maried     health          wage       
##  other:3176   no :2446   no :4286   Min.   :-3.579  
##  black: 504   yes:1914   yes:  74   1st Qu.: 1.351  
##  hisp : 680                         Median : 1.671  
##                                     Mean   : 1.649  
##                                     3rd Qu.: 1.991  
##                                     Max.   : 4.052  
##                                                     
##                              industry   
##  Manufacturing                   :1231  
##  Trade                           :1169  
##  Professional_and_Related Service: 333  
##  Business_and_Repair_Service     : 331  
##  Construction                    : 327  
##  Transportation                  : 286  
##  (Other)                         : 683  
##                                occupation            residence   
##  Craftsmen, Foremen_and_kindred     :934   rural_area     :  85  
##  Operatives_and_kindred             :881   north_east     : 733  
##  Service_Workers                    :509   nothern_central: 964  
##  Clerical_and_kindred               :486   south          :1333  
##  Professional, Technical_and_kindred:453   NA's           :1245  
##  Laborers_and_farmers               :401                         
##  (Other)                            :696
str(Males)
## 'data.frame':    4360 obs. of  12 variables:
##  $ nr        : int  13 13 13 13 13 13 13 13 17 17 ...
##  $ year      : int  1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
##  $ school    : int  14 14 14 14 14 14 14 14 13 13 ...
##  $ exper     : int  1 2 3 4 5 6 7 8 4 5 ...
##  $ union     : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ethn      : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ maried    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ health    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wage      : num  1.2 1.85 1.34 1.43 1.57 ...
##  $ industry  : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
##  $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
##  $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...

Train-Test Split

Here, we will split data to training and testing set. 70% will go for training and 30% for testing.

Yeo-Johnson Transformation

Most of the methods will only work well with normalized data. So, we have to transform our variables to another form which are more normally distributed. Yeo-Johnson tansformation will help us here,

Density Plot

To display effect of YJ transform density of one variable before and after transformation is plotted.

## Univariate Outlier Detection

Box Plots

Now, let’s plot box plots for numeric variables in the dataset. If you see a datapoint outside the whiskers, then you need to examine that particular observation.

Five Number Summary

Boxplots are created on basis of 5 number summary. This includes,

Let’s detect 5 number summary for variable wages,

## [1] -0.638459  2.035505  2.684175  3.404784  9.405495
## [1] "Outliers are outside region -0.0184130137921463 5.45870141858874"

Histogram

A histogram is a frequency distribution of given data. It’s major purpose here is to get a visual idea about skewness, range, value distribution of each numeric variable. Histogram for wage be like,

Z-Score

Z-score is a measure of relationship between an observation with mean and standard deviation of group of observations. Usually, a z-score outside (-3,+3) is considered as a novelty. Outliers found for wage using Z Score is as follows,

##         nr year school exper union  ethn maried health       wage
## 1327  2868 1986     13     9    no other    yes     no  3.3975260
## 1734  3525 1985     14     6    no other     no     no  3.4493634
## 2899  6987 1982     15     4    no other    yes     no  3.0858335
## 3051  7784 1982     11     3    no other     no     no  3.4727067
## 3053  7784 1984     11     5    no other     no     no  3.7774921
## 3054  7784 1985     11     6    no other     no     no  4.0518600
## 3056  7784 1987     11     8    no other     no     no  3.0963043
## 3134  8090 1985     14     7    no other    yes     no  3.0114436
## 3135  8090 1986     14     8    no other    yes     no  3.0992747
## 3136  8090 1987     14     9    no other    yes     no  3.0966616
## 3844  9859 1983     12     7   yes  hisp     no     no -0.7909872
## 4297 12410 1980     12     3   yes other     no     no -0.8153648
##                              industry                          occupation
## 1327                            Trade Managers, Officials_and_Proprietors
## 1734 Professional_and_Related Service                       Sales_Workers
## 2899                            Trade Managers, Officials_and_Proprietors
## 3051      Business_and_Repair_Service                       Sales_Workers
## 3053                          Finance                       Sales_Workers
## 3054                          Finance                       Sales_Workers
## 3056                          Finance                       Sales_Workers
## 3134                          Finance                       Sales_Workers
## 3135                          Finance                       Sales_Workers
## 3136                          Finance Managers, Officials_and_Proprietors
## 3844            Public_Administration                Laborers_and_farmers
## 4297                    Manufacturing              Operatives_and_kindred
##       residence
## 1327 north_east
## 1734      south
## 2899 north_east
## 3051      south
## 3053      south
## 3054      south
## 3056      south
## 3134       <NA>
## 3135       <NA>
## 3136       <NA>
## 3844       <NA>
## 4297       <NA>

Bivariate Outlier Detection

Now onwards we will see how to identify outliers when 2 or more variables are combined,

Sactter Plot

From the scatter plot one can manually identify outliers by considering two vairables. Let’s plot for variables school and exper,

## Warning: 'scatter' objects don't have these attributes: 'hover_data'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

Bag Plot

Bag plot is an improved version of box plot capable of finding outliers in 2/3 dimension data. Using package aplpack::bagplot, it was possible to visualize and fetch outliers in selected variables.