In this post, we will go through different outlier detection methods and topics like,
- Power Transformation (Yeo Johnson Transform) and Density Plots
- Box Plots and 5 Point Summary Statistics
- Z-Score
- Histograms and Scatter Plots
- Mosaic Plots and Bag Plots
- Mahalanobis Distance
- Cook’s Distance
- DBScan
- Local Outlier Factors
- 1-class Support Vector Machine
Dataset under consideration - Males (from package - Ecdat)
First, we will explore more about data using str(), summary() methods,
## nr year school exper union
## Min. : 13 Min. :1980 Min. : 3.00 Min. : 0.000 no :3296
## 1st Qu.: 2329 1st Qu.:1982 1st Qu.:11.00 1st Qu.: 4.000 yes:1064
## Median : 4569 Median :1984 Median :12.00 Median : 6.000
## Mean : 5262 Mean :1984 Mean :11.77 Mean : 6.515
## 3rd Qu.: 8406 3rd Qu.:1985 3rd Qu.:12.00 3rd Qu.: 9.000
## Max. :12548 Max. :1987 Max. :16.00 Max. :18.000
##
## ethn maried health wage
## other:3176 no :2446 no :4286 Min. :-3.579
## black: 504 yes:1914 yes: 74 1st Qu.: 1.351
## hisp : 680 Median : 1.671
## Mean : 1.649
## 3rd Qu.: 1.991
## Max. : 4.052
##
## industry
## Manufacturing :1231
## Trade :1169
## Professional_and_Related Service: 333
## Business_and_Repair_Service : 331
## Construction : 327
## Transportation : 286
## (Other) : 683
## occupation residence
## Craftsmen, Foremen_and_kindred :934 rural_area : 85
## Operatives_and_kindred :881 north_east : 733
## Service_Workers :509 nothern_central: 964
## Clerical_and_kindred :486 south :1333
## Professional, Technical_and_kindred:453 NA's :1245
## Laborers_and_farmers :401
## (Other) :696
## 'data.frame': 4360 obs. of 12 variables:
## $ nr : int 13 13 13 13 13 13 13 13 17 17 ...
## $ year : int 1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
## $ school : int 14 14 14 14 14 14 14 14 13 13 ...
## $ exper : int 1 2 3 4 5 6 7 8 4 5 ...
## $ union : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ ethn : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ maried : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ health : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ wage : num 1.2 1.85 1.34 1.43 1.57 ...
## $ industry : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
## $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
## $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...
Train-Test Split
Here, we will split data to training and testing set. 70% will go for training and 30% for testing.
set.seed(7)
# 70% of the sample size
smp_size <- floor(0.7 * nrow(Males))
# Train and Test Split
train_ind <- sample(seq_len(nrow(Males)), size = smp_size)
tr <- Males[train_ind, ]
te <- Males[-train_ind, ]
# Column names split
cols <- colnames(Males)
cols_num <- cols[3:4]
cols_num_all <- c(cols_num, cols[9])
cols_cat <- c(cols[5:8], cols[10:12])
Yeo-Johnson Transformation
Most of the methods will only work well with normalized data. So, we have to transform our variables to another form which are more normally distributed. Yeo-Johnson tansformation will help us here,
library(plotly)
library(recipes)
# Performing Yeo Johnson transform for Normalization
# Setting variable price as outcome and remaining numerics as predictors
yj_estimates <-
recipe(as.formula(paste("wage ~ ", paste(cols_num, collapse = " + "))), data = tr) %>%
# Power transformation step
step_YeoJohnson(all_numeric()) %>%
# Feeds training data
prep(data = tr)
# The trained process is run of test set
yj_t <- bake(yj_estimates, te)
Density Plot
To display effect of YJ transform density of one variable before and after transformation is plotted.
## Univariate Outlier DetectionBox Plots
Now, let’s plot box plots for numeric variables in the dataset. If you see a datapoint outside the whiskers, then you need to examine that particular observation.
Five Number Summary
Boxplots are created on basis of 5 number summary. This includes,
- Minimum
- Q1 - 25%
- Median
- Q3 - 75%
- Maximum
Let’s detect 5 number summary for variable wages,
## [1] -0.638459 2.035505 2.684175 3.404784 9.405495
# Calculating end point of whiskers
low = fnum[2] - 1.5 * (fnum[4] - fnum[2])
high = fnum[4] + 1.5 * (fnum[4] - fnum[2])
print(paste("Outliers are outside region", low, high))
## [1] "Outliers are outside region -0.0184130137921463 5.45870141858874"
Histogram
A histogram is a frequency distribution of given data. It’s major purpose here is to get a visual idea about skewness, range, value distribution of each numeric variable. Histogram for wage be like,
Z-Score
Z-score is a measure of relationship between an observation with mean and standard deviation of group of observations. Usually, a z-score outside (-3,+3) is considered as a novelty. Outliers found for wage using Z Score is as follows,
rows_w <- c()
dfk <- scale(yj_t$wage)
for (i in 1:length(dfk)) {
if (dfk[i] > 3 | dfk[i] < -3) {
rows_w = c(rows_w, i)
}
}
te[rows_w,]
## nr year school exper union ethn maried health wage
## 1327 2868 1986 13 9 no other yes no 3.3975260
## 1734 3525 1985 14 6 no other no no 3.4493634
## 2899 6987 1982 15 4 no other yes no 3.0858335
## 3051 7784 1982 11 3 no other no no 3.4727067
## 3053 7784 1984 11 5 no other no no 3.7774921
## 3054 7784 1985 11 6 no other no no 4.0518600
## 3056 7784 1987 11 8 no other no no 3.0963043
## 3134 8090 1985 14 7 no other yes no 3.0114436
## 3135 8090 1986 14 8 no other yes no 3.0992747
## 3136 8090 1987 14 9 no other yes no 3.0966616
## 3844 9859 1983 12 7 yes hisp no no -0.7909872
## 4297 12410 1980 12 3 yes other no no -0.8153648
## industry occupation
## 1327 Trade Managers, Officials_and_Proprietors
## 1734 Professional_and_Related Service Sales_Workers
## 2899 Trade Managers, Officials_and_Proprietors
## 3051 Business_and_Repair_Service Sales_Workers
## 3053 Finance Sales_Workers
## 3054 Finance Sales_Workers
## 3056 Finance Sales_Workers
## 3134 Finance Sales_Workers
## 3135 Finance Sales_Workers
## 3136 Finance Managers, Officials_and_Proprietors
## 3844 Public_Administration Laborers_and_farmers
## 4297 Manufacturing Operatives_and_kindred
## residence
## 1327 north_east
## 1734 south
## 2899 north_east
## 3051 south
## 3053 south
## 3054 south
## 3056 south
## 3134 <NA>
## 3135 <NA>
## 3136 <NA>
## 3844 <NA>
## 4297 <NA>
Bivariate Outlier Detection
Now onwards we will see how to identify outliers when 2 or more variables are combined,
Sactter Plot
From the scatter plot one can manually identify outliers by considering two vairables. Let’s plot for variables school and exper,
h_dat = paste("Obs No :",
rownames(te))
sc <-
plot_ly(
data = yj_t,
x = ~school,
y = ~exper,
hover_data = 'text',
text = h_dat
)
layout(sc,
yaxis = list(title = "school"),
xaxis = list(title = "exper")
)
## Warning: 'scatter' objects don't have these attributes: 'hover_data'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'stackgroup', 'orientation', 'groupnorm', 'stackgaps', 'text', 'texttemplate', 'hovertext', 'mode', 'hoveron', 'hovertemplate', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'selected', 'unselected', 'textposition', 'textfont', 'r', 't', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
Bag Plot
Bag plot is an improved version of box plot capable of finding outliers in 2/3 dimension data. Using package aplpack::bagplot, it was possible to visualize and fetch outliers in selected variables.