In this post, we will go through different outlier detection methods and topics like,

Dataset under consideration - Males (from package - Ecdat)

First, we will explore more about data using str(), summary() methods,

##        nr             year          school          exper        union     
##  Min.   :   13   Min.   :1980   Min.   : 3.00   Min.   : 0.000   no :3296  
##  1st Qu.: 2329   1st Qu.:1982   1st Qu.:11.00   1st Qu.: 4.000   yes:1064  
##  Median : 4569   Median :1984   Median :12.00   Median : 6.000             
##  Mean   : 5262   Mean   :1984   Mean   :11.77   Mean   : 6.515             
##  3rd Qu.: 8406   3rd Qu.:1985   3rd Qu.:12.00   3rd Qu.: 9.000             
##  Max.   :12548   Max.   :1987   Max.   :16.00   Max.   :18.000             
##     ethn      maried     health          wage       
##  other:3176   no :2446   no :4286   Min.   :-3.579  
##  black: 504   yes:1914   yes:  74   1st Qu.: 1.351  
##  hisp : 680                         Median : 1.671  
##                                     Mean   : 1.649  
##                                     3rd Qu.: 1.991  
##                                     Max.   : 4.052  
##                              industry   
##  Manufacturing                   :1231  
##  Trade                           :1169  
##  Professional_and_Related Service: 333  
##  Business_and_Repair_Service     : 331  
##  Construction                    : 327  
##  Transportation                  : 286  
##  (Other)                         : 683  
##                                occupation            residence   
##  Craftsmen, Foremen_and_kindred     :934   rural_area     :  85  
##  Operatives_and_kindred             :881   north_east     : 733  
##  Service_Workers                    :509   nothern_central: 964  
##  Clerical_and_kindred               :486   south          :1333  
##  Professional, Technical_and_kindred:453   NA's           :1245  
##  Laborers_and_farmers               :401                         
##  (Other)                            :696
## 'data.frame':    4360 obs. of  12 variables:
##  $ nr        : int  13 13 13 13 13 13 13 13 17 17 ...
##  $ year      : int  1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
##  $ school    : int  14 14 14 14 14 14 14 14 13 13 ...
##  $ exper     : int  1 2 3 4 5 6 7 8 4 5 ...
##  $ union     : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ethn      : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ maried    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ health    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wage      : num  1.2 1.85 1.34 1.43 1.57 ...
##  $ industry  : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
##  $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
##  $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...

Train-Test Split

Here, we will split data to training and testing set. 70% will go for training and 30% for testing.

Yeo-Johnson Transformation

Most of the methods will only work well with normalized data. So, we have to transform our variables to another form which are more normally distributed. Yeo-Johnson tansformation will help us here,

Density Plot

To display effect of YJ transform density of one variable before and after transformation is plotted.

Density plot of non-transformed variablexy

Density plot of transformed variablexy
## Univariate Outlier Detection

Box Plots

Now, let’s plot box plots for numeric variables in the dataset. If you see a datapoint outside the whiskers, then you need to examine that particular observation.


Five Number Summary

Boxplots are created on basis of 5 number summary. This includes,

Let’s detect 5 number summary for variable wages,

## [1] -0.638459  2.035505  2.684175  3.404784  9.405495
## [1] "Outliers are outside region -0.0184130137921463 5.45870141858874"


A histogram is a frequency distribution of given data. It’s major purpose here is to get a visual idea about skewness, range, value distribution of each numeric variable. Histogram for wage be like,


Z-score is a measure of relationship between an observation with mean and standard deviation of group of observations. Usually, a z-score outside (-3,+3) is considered as a novelty. Outliers found for wage using Z Score is as follows,

##         nr year school exper union  ethn maried health       wage
## 1327  2868 1986     13     9    no other    yes     no  3.3975260
## 1734  3525 1985     14     6    no other     no     no  3.4493634
## 2899  6987 1982     15     4    no other    yes     no  3.0858335
## 3051  7784 1982     11     3    no other     no     no  3.4727067
## 3053  7784 1984     11     5    no other     no     no  3.7774921
## 3054  7784 1985     11     6    no other     no     no  4.0518600
## 3056  7784 1987     11     8    no other     no     no  3.0963043
## 3134  8090 1985     14     7    no other    yes     no  3.0114436
## 3135  8090 1986     14     8    no other    yes     no  3.0992747
## 3136  8090 1987     14     9    no other    yes     no  3.0966616
## 3844  9859 1983     12     7   yes  hisp     no     no -0.7909872
## 4297 12410 1980     12     3   yes other     no     no -0.8153648
##                              industry                          occupation
## 1327                            Trade Managers, Officials_and_Proprietors
## 1734 Professional_and_Related Service                       Sales_Workers
## 2899                            Trade Managers, Officials_and_Proprietors
## 3051      Business_and_Repair_Service                       Sales_Workers
## 3053                          Finance                       Sales_Workers
## 3054                          Finance                       Sales_Workers
## 3056                          Finance                       Sales_Workers
## 3134                          Finance                       Sales_Workers
## 3135                          Finance                       Sales_Workers
## 3136                          Finance Managers, Officials_and_Proprietors
## 3844            Public_Administration                Laborers_and_farmers
## 4297                    Manufacturing              Operatives_and_kindred
##       residence
## 1327 north_east
## 1734      south
## 2899 north_east
## 3051      south
## 3053      south
## 3054      south
## 3056      south
## 3134       <NA>
## 3135       <NA>
## 3136       <NA>
## 3844       <NA>
## 4297       <NA>

Bivariate Outlier Detection

Now onwards we will see how to identify outliers when 2 or more variables are combined,

Sactter Plot

From the scatter plot one can manually identify outliers by considering two vairables. Let’s plot for variables school and exper,

Bag Plot

Bag plot is an improved version of box plot capable of finding outliers in 2/3 dimension data. Using package aplpack::bagplot, it was possible to visualize and fetch outliers in selected variables.