<- read.csv("asml-trades.csv", skip = 3) df
13 Data Cleaning
Often when we get a dataset we need to analyze, it’s not always exactly structured and formatted in a way that we can immediately work with it. Here are some examples of this:
- The data don’t start at the top of the file because the first few rows contain some other information.
- The dates are not formatted correctly.
- Numbers are interpreted as characters.
- The data contain extra columns that we don’t want.
- There are rows with missing data that we want to omit.
- The variable names are not what we want them to be.
In these cases we need to clean the data before we can work with it. By cleaning we don’t mean modifying the underlying data. It just means bringing the dataset into a format such that we can more easily work with it in R.
Some datasets are “dirtier” than others, sometimes so “dirty” that it can take weeks or even months to clean. Fortunately we will stick to “lightly unkempt” data for this course that can be cleaned with only a few lines of code.
In this chapter we will learn some basic data cleaning techniques to deal with the 6 example issues listed above. We will do this using stock price data for the company ASML from the Amsterdam Stock Exchange.
Download the following file: asml-trades.csv and put it in your project folder. The variable names and meanings are:
Date
: The date the data from that row are from.Open
: The opening price of the stock on that day.High
: The highest price the stock traded at on that day.Low
: The lowest price the stock traded at on that day.Last
: The price of the last-traded stock at on that day.Close
: The closing price of the stock on that day.Number.of.Shares
: The number of shares traded that day.Number.of.Trades
: The number of trades made that day.
13.1 Skipping Rows
Before reading the dataset into R, let’s take a quick look at the text in the file itself. In the Files tab in RStudio, click on the asml-trades.csv
file and click “View File”. The text of the file opens in RStudio. The first few lines should look like:
Historical Data,,,,,,,,
From 2021-08-31 to 2023-08-30,,,,,,,,
NL0010273215,,,,,,,,
Date,Open,High,Low,Last,Close,Number of Shares,Number of Trades,Print table
31/8/2021,713.8,717.7,699.9,704.5,704.5,978555,36599,
1/9/2021,712.4,715.5,706.4,714.2,714.2,674506,38527,
2/9/2021,713.9,723.4,708.4,722.2,722.2,533840,29229,
We immediately notice that the first 3 rows contain information about the data that we don’t want to include in our dataframe. The variable names are on line 4 instead of line 1.
One option would be to delete those rows and save the file. However, it is bad practice to directly edit a data file. One should always work with the raw CSV file as it was downloaded. This makes it easier to reproduce your work and show your steps through your R script. Fortunately the read.csv()
function has an option to skip rows. We can use this with:
We could also have read in the data directly from the URL with:
<- read.csv("https://walshc.github.io/ebi-prog/asml-trades.csv", skip = 3) df
This way we wouldn’t have to set up an RStudio project or change the working directory.
Let’s take a first look at the data with summary()
:
summary(df)
Date Open High Low
Length:521 Min. :394.7 Min. :408.2 Min. :375.8
Class :character 1st Qu.:535.5 1st Qu.:545.5 1st Qu.:525.8
Mode :character Median :592.1 Median :597.4 Median :582.5
Mean :589.2 Mean :597.5 Mean :579.6
3rd Qu.:645.5 3rd Qu.:652.5 3rd Qu.:636.1
Max. :770.5 Max. :777.5 Max. :764.2
NA's :6 NA's :6 NA's :6
Last Close Number.of.Shares Number.of.Trades
Min. :397.4 Min. :397.4 Length:521 Length:521
1st Qu.:535.9 1st Qu.:535.9 Class :character Class :character
Median :589.4 Median :589.4 Mode :character Mode :character
Mean :588.4 Mean :588.4
3rd Qu.:644.0 3rd Qu.:644.0
Max. :770.5 Max. :770.5
NA's :6 NA's :6
Print.table
Mode:logical
NA's:521
From this we can see a few problems:
- From the summary of the
Date
variable, we can see that R read it in as a character. It did not recognize that it was a date. - The variables
Open
,High
,Low
,Last
andClose
contain 6NA
s.NA
stands for “Not Available” and is what R uses to represent missing values. - For the variables
Number.of.Shares
andNumber.of.Trades
, we can see that they were read in as characters instead of numbers. Print.table
has 521NA
s (all values areNA
), thus this variable is useless and should be deleted.- We also want to change some of the variable names and also change the names to lower case and replace the dots (
.
) with underscores (_
).
We will work through these problems for the rest of this chapter.
13.2 Formatting Dates
13.2.1 Converting Dates in the ASML Example
We will start by converting the date variable from a character to a date variable. This is useful for doing operations with the date (such as subsetting the data on observations before/after a particular date) and for plotting.
We can convert the character to a date using the as.Date()
function. The first argument of the function is the vector of dates that need to be converted, and the format
argument specifies the format the date is written in. Let’s see what format the dates are written in using the head()
command to see the first few rows:
head(df$Date)
[1] "31/8/2021" "1/9/2021" "2/9/2021" "3/9/2021" "6/9/2021" "7/9/2021"
When we look at the data, we can see that the format is dd/mm/yyyy. To specify this we need to write format = "%d/%m/%Y"
.
$Date <- as.Date(df$Date, format = "%d/%m/%Y") df
We can check what this did with the head()
command again:
head(df$Date)
[1] "2021-08-31" "2021-09-01" "2021-09-02" "2021-09-03" "2021-09-06"
[6] "2021-09-07"
We can see that when R has dates correctly formatted they show up in the format yyyy-mm-dd. This is the default format.
Now that the date is formatted correctly, the summary()
command shows the first and last dates in the data (31 August 2021 and 29 August 2023):
summary(df$Date)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2021-08-31" "2022-03-01" "2022-08-30" "2022-08-29" "2023-02-28" "2023-08-29"
13.2.2 Converting Dates from Other Formats
Dates in the Netherlands are typically written like dd-mm-yyyy. If it was the case we would instead do "%d-%m-%Y"
. You can try this out with:
as.Date("1-9-2023", format = "%d-%m-%Y")
[1] "2023-09-01"
If the dates were in mm/dd/yyyy format, as is typical in USA, we would do "%m/%d/%Y"
.
as.Date("12/31/2022", format = "%m/%d/%Y")
[1] "2022-12-31"
Sometimes we omit the century from years. We might write the 1st of September 2023 as 01/09/23. In R we need to use %y
instead of %Y
for these abbreviated years:1
as.Date("01/09/22", format = "%d/%m/%y")
[1] "2022-09-01"
13.2.3 Converting Dates with Month Names (Optional)
Sometimes the dates have the month name in words. To convert this we need to use the %b
option for abbreviated month names and %B
for full month names:
as.Date("Sep 1 2023", format = "%b %d %Y")
[1] "2023-09-01"
as.Date("January 1 2023", format = "%B %d %Y")
[1] "2023-01-01"
Note that if your computer is not in English it might not work as R expects to read months in the language of your computer. To get around this, you can first set the language for dates to English using the Sys.setlocale()
function before formatting the dates. Because of the complications with month names in different languages, I will not ask questions on the assignments or exam involving this. I am just providing this information in case it may be useful for you later.
13.3 Converting Characters to Numbers
For the variables Number.of.Shares
and Number.of.Trades
, we saw that they were read in as characters instead of numbers. Usually this happens when there are some letters or other non-numeric symbols (like the % symbol) somewhere in the data. This is because all elements of vectors in R (the individual columns of a data.frame
) must have the same data type. If there are any character elements in a vector, the remaining elements are coerced into characters.
If we look through the data we can see that some rows have "None"
written instead of NA
.2
To do this conversion we first replace the "None"
values with NA
. We do this by assigning NA
to the subset of values where df$Number.of.Shares == "None"
:3
$Number.of.Shares[df$Number.of.Shares == "None"] <- NA df
Once we have done this we use the as.numeric()
function to convert the values from character
to numeric
:
$Number.of.Shares <- as.numeric(df$Number.of.Shares) df
Now when we summarize we see that it’s treated as a number:
summary(df$Number.of.Shares)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
84141 571178 711067 783291 916000 2932273 6
We can do the same with the Number.of.Trades
. We are also able to do this skipping the step of converting the "None"
to NA
:
$Number.of.Trades <- as.numeric(df$Number.of.Trades) df
Warning: NAs introduced by coercion
When we do this, however, we see that R warned us that it converted some observations to NA
. A warning is different to an error in that R still completes the operation (in an error it will just stop). But it is warning us because we may not have expected some values to be forced to NA
. In general it is better to code in a way that doesn’t generate warnings, so I recommend setting the non-numeric values to NA
first.
13.4 Deleting columns
The last column of the data, Print.table
, contains no data. It has NA
for all rows.
To delete a variable we write over the variable with NULL
. This essentially replaces it with nothing:
$Print.table <- NULL df
An alternative approach would be to use the column index of the variable we want to drop. Print.table
is the 9th column in the data, and to drop the 9th column we could use:
<- df[, -9] df
If we view the data in RStudio, we can see that it is now deleted.
13.5 Dropping rows with missing data
Let’s take a look at how our data look now:
summary(df)
Date Open High Low
Min. :2021-08-31 Min. :394.7 Min. :408.2 Min. :375.8
1st Qu.:2022-03-01 1st Qu.:535.5 1st Qu.:545.5 1st Qu.:525.8
Median :2022-08-30 Median :592.1 Median :597.4 Median :582.5
Mean :2022-08-29 Mean :589.2 Mean :597.5 Mean :579.6
3rd Qu.:2023-02-28 3rd Qu.:645.5 3rd Qu.:652.5 3rd Qu.:636.1
Max. :2023-08-29 Max. :770.5 Max. :777.5 Max. :764.2
NA's :6 NA's :6 NA's :6
Last Close Number.of.Shares Number.of.Trades
Min. :397.4 Min. :397.4 Min. : 84141 Min. : 4398
1st Qu.:535.9 1st Qu.:535.9 1st Qu.: 571178 1st Qu.: 26290
Median :589.4 Median :589.4 Median : 711067 Median : 33375
Mean :588.4 Mean :588.4 Mean : 783291 Mean : 35830
3rd Qu.:644.0 3rd Qu.:644.0 3rd Qu.: 916000 3rd Qu.: 42707
Max. :770.5 Max. :770.5 Max. :2932273 Max. :108957
NA's :6 NA's :6 NA's :6 NA's :6
We can see that there are 6 NA
s for all variables except the Date
variable. If we scroll through the data we notice that 6 rows with NA
s are the same for all variables. Let’s have a look at what dates these are. We can do this using the is.na()
function. This function, when applied to a vector, returns TRUE
if the element is NA
and FALSE
if not. To see the dates when the variables are missing, we can do:
$Date[is.na(df$Open)] df
[1] "2022-04-15" "2022-04-18" "2022-12-26" "2023-04-07" "2023-04-10"
[6] "2023-05-01"
We can see that the missings were:
- 2022-04-15: Good Friday
- 2022-04-18: Easter Monday (Tweede paasdag)
- 2022-12-26: Day after Christmas (Tweede kerstdag)
- 2023-04-07: Good Friday
- 2023-04-10: Easter Monday (Tweede paasdag)
- 2023-05-01: Labor Day (Dag van de Arbeid)
These are weekdays where the Amsterdam stock market is closed. We can drop rows with any missing values using the na.omit()
function. I will print the number of rows in df
before and after the operation to show what is happening:
nrow(df)
[1] 521
<- na.omit(df)
df nrow(df)
[1] 515
We can see that we fell from 521 observations to 515 after deleting the 6 holidays from the data.
13.6 Renaming Variables
Although the variable names are quite okay, suppose we wanted to change some of them.
Suppose we wanted to change the name of "Number.of.Shares"
to "num_shares"
to make it shorter to type, and to replace the dot with an underscore. Because we know it occupies the 7th column, we can change the name with:
names(df)[7] <- "num_shares"
Suppose we also wanted to change the name of "Number.of.Trades"
to "num_trades"
. Counting columns like we did above increases our chances of making a mistake (besides, we need to do a lot of counting). We can instead change the name using the old name as follows:
names(df)[names(df) == "Number.of.Trades"] <- "num_trades"
How this works is names(df) == "Number.of.Trades"
is TRUE
only in the 8th column when the name is actually "Number.of.Trades"
, and so it changes the name of only that column.
We can also change the names of multiple columns at the same time. Suppose we wanted to change Open, High, Low and Last to lower case. We can do:
names(df)[2:5] <- c("open", "high", "low", "last")
If we wanted to quickly change all variable names to lower case, we can use the tolower()
function. The tolower()
function converts upper case characters to lower case:
<- c("hello!", "HELLO!", "Hello!", "HeLlO!")
test tolower(test)
[1] "hello!" "hello!" "hello!" "hello!"
Let’s use it to change the names of the data set:
names(df) <- tolower(names(df))
names(df)
[1] "date" "open" "high" "low" "last"
[6] "close" "num_shares" "num_trades"
If the date is
01/01/69
, the format%d/%m/%y
will interpret it as January 1 1969. But if the date is01/01/68
, it will interpret it as January 1 2068. All short-format years after 69 are put in the 1900s and all short-format years before 69 are put in the 2000s. You don’t need to remember these details for the exam though because we won’t ever use dates outside of 1969-2068.↩︎As a shortcut, we could have found all the non-numeric characters in a variable using the following command:
unique(grep("[^0-9]", df$Number.of.Trades, value = TRUE))
. You don’t need to remember this command for the exam.↩︎We could also have replaced all elements with non-numeric characters without knowing what they are with the following:
df$Number.of.Shares[grepl("[^0-9]", df$Number.of.Trades)] <- NA
. You don’t need to remember this command for the exam.↩︎