Sometimes we have some problems with missing values in real-life datasets. Sometimes missing values is a real problem for research. How to recover missing values in our datasets? In this post we will consider one very interesting method of recovering missing values. It is method a bagging of regression trees. It provides the recovery of missing values for several variables at once, based on regression dependencies.
For bagging of regression trees we will use caret package for R programming language. As learning dataset we use file with results air pollution monitoring on one of monitoring station in Kryvyi Rih city (Ukraine).
For writing this post I used R version 3.3.3 and caret package version 6.0-76.
So! Let’s go!
First at all we need to install and download caret package to the work environment:
Secondly we must download learning dataset. Its name is PSZ72014.csv. After this we download dataset into the work environment and use function summary for observation:
PSZ72014 <- read.csv("PSZ72014.csv", header = T, sep = ",", dec = ".") summary(PSZ72014)
And now we look our data. Our dataset contain 13 columns and 1076 rows. What do we see? We see that some variables have only several missing values. But two variables such as CH2O (formaldehyde) and CO (carbon monooxyde) have 513 missing values both of them. Also two variables such as SO2 (sulphur dioxide) and NO2 (nitric dioxide) have 280 missing values. We will try to recover all missing values in our dataset.
Firstly we have to do data preparation for regression trees bagging modeling:
PSZ72014.pre <- preProcess(PSZ72014[, c(2:13)], method = "bagImpute")
After data preparation we make a secondary dataset. In this dataset we will write predicted values.
PSZ72014.bagImpute <- PSZ72014 PSZ72014.bagImpute[, c(2:13)] <- predict(PSZ72014.pre, PSZ72014.bagImpute[, c(2:13)])
After predicting all non-missing values are stored!
But we have a problem! How can we validate our new predicted values? One of the simpliest way is comparison a central moments for variables of both datasets. We will make two matrixes with values of central moments and compare these values.
tmp <- do.call(cbind, lapply(PSZ72014[, c(2:13)], summary)) tmp2 <- do.call(cbind, lapply(PSZ72014.bagImpute[, c(2:13)], summary)) tmp[1:6,] - tmp2
So. We have a new dataset without NA-values. And we can find different patterns in previously missed time intervals.
PS The second version of this post publicated on RPubs.
In this post we have looked the processing of raw data and inserting the missed values. The archive with raw data and program code you can download after paying of 1 USD.
File: badimput_r.7z size: 32.1 kb
The archive contains three files:
- Raw data with air pollution observation (file “PSZ72014.csv”);
- R-session file with downloaded row data (file “BagImput.RData”);
- RMarkdown file with programming code for data manipulation and inserting (calculating) missing values