Hàm holt winter trong r thực hiện như thế nào

Xu thế là sự vận động tăng hay giảm của dữ liệu trong 1 thời gian dài có thể được mô tả bằng một đường thẳng ( xu thế tuyến tính ) hay dạng đường cong( xu thế phi tuyến). Có thể mô hình hóa xu thế bằng cách thực hiện 1 hàm hồi quy thích hợp giữa biến cần dự báo( biến Y) và thời gian ( biến t). Hàm hồi quy này được sử dụng để tạo ra các giá trị dự báo trong tương lai.

Time series forecasting is a great way to predict future events when we only have historical data to guide us. We can use it to predict things like the weather, stock markets, or even ice cream consumption. In this article I’ll guide you though time series setup, creating fits to the data, predicting the future, and model evaluation using the ubiquitous Holt-Winters forecasting.

For this example, we’ll be using a dataset that tracks candy production in the US by month, available on kaggle. With Halloween and Christmas towards the end of the year, followed closely by those New Year’s resolutions, it’s not surprising that candy consumption and production can have very strong seasonal components. However, the beauty of the time series forecasting shown here is that it can isolate seasonal components to reveal year-over-year trends that could indicate metrics like annual growth rates.

Data Setup

First, let’s import this data set and take a quick look at what we have:

df <- read.csv('Data/candy_production.csv', header=TRUE) head(df)

Here we see two columns — we have a “date-like”

Symbol Meaning Example

%d day as a number (0-31) 01-31

%a abbreviated weekday Mon

%A abbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan

%B unabbreviated month January

%y 2-digit year 07

%Y 4-digit year 2007

0 with values like

Symbol Meaning Example

%d day as a number (0-31) 01-31

%a abbreviated weekday Mon

%A abbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan

%B unabbreviated month January

%y 2-digit year 07

%Y 4-digit year 2007

1 , and a second column with the candy production value. Simple enough. As we don’t actually have a date-type column, it’s never a bad idea to add one (though not required for performing time series!!). If we ever want to filter or subset data, it certainly does make life a lot easier. To do this, we look at the date values to extract the format it’s in. In this case it’s year-month-day (YYYY-MM-DD). Here’s a quick cheat sheet for the conversion so you don’t have to scour the internet:

Symbol Meaning Example

%d day as a number (0-31) 01-31

%a abbreviated weekday Mon

%A abbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan

%B unabbreviated month January

%y 2-digit year 07

%Y 4-digit year 2007

So for us, let’s create a new column that is of type “Date” like this:

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

Should we get into the case where our time series uses hours/minutes/seconds or time zones, we’d need to convert into a POSIXct, but that’s not needed here.

Converting to a time series

This is a very simple step, but also one of the easiest to forget. Before we do anything regarding forecasting, we need to tell R that this data is a time series. To do this, we make a time series object. There are two critical inputs we must give the function — frequency and start.

First, we should accept that the function that creates a time series object isn’t smart. Despite our data set having very readable date-like values, we’ll need to manually tell it the frequency of our readings — that is, how many rows of data you have in a year. In this case, we have one data point per month, so

Symbol Meaning Example

%d day as a number (0-31) 01-31

%a abbreviated weekday Mon

%A abbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan

%B unabbreviated month January

%y 2-digit year 07

%Y 4-digit year 2007

2. For start, if we’re not interested in attaching specific dates with our results, then we can just enter “1”, but since we usually do we’ll indicate the year and month of the first data point as

Symbol Meaning Example

%d day as a number (0-31) 01-31

%a abbreviated weekday Mon

%A abbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan

%B unabbreviated month January

%y 2-digit year 07

%Y 4-digit year 2007

3.

dfts <- ts(df$IPG3113N, frequency=12, start=c(1972,1)) dfts

Inspecting

Symbol Meaning Example

%d day as a number (0-31) 01-31

%a abbreviated weekday Mon

%A abbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan

%B unabbreviated month January

%y 2-digit year 07

%Y 4-digit year 2007

4 shows how the data is now stacked in years and months.

Decomposing

Immediately we can “decompose” the time series — which in this case means separating out the 3 main components that make up the time series:

  • trend: the long-term trends in the data
  • seasonal: the repeated seasonal signal adder
  • random: the “left-over” components that aren’t expected from the seasonality or trend components.

We can easily extract these components and plot them with:

components_dfts <- decompose(dfts) plot(components_dfts)

As we look at the decomposition components, we can visually see how they can add up to our “observed” value (our real values). It’s also very important to inspect the scales of each section to see which one is more dominant. For example, if “random” has a range significantly larger than seasonal or trend, this data is going to be very challenging to accurately forecast later on.

Fitting with Holt-Winters

Before we predict values in the future, we’ll need to create a fit to the data. In the most basic method, we can simply call the Holt-Winters function and let R figure out the tuning parameters on it’s own. We also have the opportunity to tune the fit manually by setting tuning variables:

  • Symbol Meaning Example

    %d day as a number (0-31) 01-31

    %a abbreviated weekday Mon

    %A abbreviated weekday Monday

    %m month (00-12) 00-12

    %b abbreviated month Jan

    %B unabbreviated month January

    %y 2-digit year 07

    %Y 4-digit year 2007

    5: the “base value”. Higher alpha puts more weight on the most recent observations.
  • Symbol Meaning Example

    %d day as a number (0-31) 01-31

    %a abbreviated weekday Mon

    %A abbreviated weekday Monday

    %m month (00-12) 00-12

    %b abbreviated month Jan

    %B unabbreviated month January

    %y 2-digit year 07

    %Y 4-digit year 2007

    6: the “trend value”. Higher beta means the trend slope is more dependent on recent trend slopes.
  • Symbol Meaning Example

    %d day as a number (0-31) 01-31

    %a abbreviated weekday Mon

    %A abbreviated weekday Monday

    %m month (00-12) 00-12

    %b abbreviated month Jan

    %B unabbreviated month January

    %y 2-digit year 07

    %Y 4-digit year 2007

    7: the “seasonal component”. Higher gamma puts more weighting on the most recent seasonal cycles.

Here we can make two fits, and plot them versus the raw data to see the quality of the fits. I highly recommend playing with the

Symbol Meaning Example

%d day as a number (0-31) 01-31

%a abbreviated weekday Mon

%A abbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan

%B unabbreviated month January

%y 2-digit year 07

%Y 4-digit year 2007

5,

Symbol Meaning Example

%d day as a number (0-31) 01-31

%a abbreviated weekday Mon

%A abbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan

%B unabbreviated month January

%y 2-digit year 07

%Y 4-digit year 2007

6, and

Symbol Meaning Example

%d day as a number (0-31) 01-31

%a abbreviated weekday Mon

%A abbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan

%B unabbreviated month January

%y 2-digit year 07

%Y 4-digit year 2007

7 values to see how the fit changes.

HW1 <- HoltWinters(dfts)# Custom HoltWinters fitting HW2 <- HoltWinters(dfts, alpha=0.2, beta=0.1, gamma=0.1)

Visually evaluate the fits

plot(dfts, ylab="candy production", xlim=c(2013,2018)) lines(HW1$fitted[,1], lty=2, col="blue") lines(HW2$fitted[,1], lty=2, col="red")

HoltWinters fits to our data

Predictions

Both fits look to follow our data quite well, so now it’s time to see how they do predicting future candy production. Using the

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

1 function we’ll need to specify how many data points we’ll predict into the future. Here, we’ll use a value of 24 to project 2 years into the future (remember, this is a monthly time series). We’d also like some concept of the “error bars” associated with the prediction to give us an idea of our prediction confidence. To do this, we set

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

2 and

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

3 to the confidence interval (0.95 selected here). Once again, we’ll plot showing the tail of our existing data and the new predictions:

HW1.pred <- predict(HW1, 24, prediction.interval = TRUE, level=0.95)

Visually evaluate the prediction

plot(dfts, ylab="candy production", xlim=c(2008.5,2020)) lines(HW1$fitted[,1], lty=2, col="blue") lines(HW1.pred[,1], col="red") lines(HW1.pred[,2], lty=2, col="orange") lines(HW1.pred[,3], lty=2, col="orange")

Predictions for our HW1 seasonally-additive fit

Seasonality Tuning

When we make fits, we also have the option to tune the behavior of the seasonality component. The standard Holt-Winters fits use an additive seasonality — where it assumes that the amplitude of any seasonality components are relatively constant throughout the series. However, if we use multiplicative seasonality, we allow the seasonal variations (amplitude) to grow with the overall level of the data. To see how that works in our candy production case, we’ll create a new fit, predict into the future, and compare to our additive fit of

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

4.

HW3 <- HoltWinters(dfts, seasonal = "multiplicative") HW3.pred <- predict(HW3, 24, prediction.interval = TRUE, level=0.95) plot(dfts, ylab="candy production", xlim=c(2008.5,2020)) lines(HW3$fitted[,1], lty=2, col="blue") lines(HW3.pred[,1], col="red") lines(HW3.pred[,2], lty=2, col="orange") lines(HW3.pred[,3], lty=2, col="orange")

Predictions for our HW3 seasonality-multiplicative fit

As we can see, the prediction looks pretty similar to our

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

4 result, but the confidence intervals spread wildly outward. For this data set, multiplicative fitting doesn’t appear to be the way to go.

The forecast library

Those new to time series forecasting may find it’s common to see examples across the web that use

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

6 instead of

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

1 to model future values. So what’s the difference? Put simply, forecast is a wrapper for predict that allows for more confidence intervals, makes plotting easier, and gives us tools to evaluate the quality of our predictions. Using our

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

4 Holt-Winters fit from before, we can use

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

6 to make new predictions and include both 80% and 95% confidence intervals. Here’s how we do it:

library(forecast) HW1_for <- forecast(HW1, h=24, level=c(80,95))

visualize our predictions:

plot(HW1_for, xlim=c(2008.5, 2020)) lines(HW1_for$fitted, lty=2, col="purple")

Forecast Evaluation

df$Date <- as.Date(df$observation_date, "%Y-%m-%d")

6 also calculates the quality of our predictions by compiling the observed values minus the predicted values for each data point. These are added to our forecast model as

dfts <- ts(df$IPG3113N, frequency=12, start=c(1972,1)) dfts

1 . To best evaluate the smoothing functions we used in our model, we want to check that there are no correlations between forecast errors. Simply put, if neighboring points in our fits continually miss the observed values in a similar fashion, our main fit line is not reactive enough to the changes in the data. To capture this, we use the

dfts <- ts(df$IPG3113N, frequency=12, start=c(1972,1)) dfts

2 function to evaluate the correlation of fit residuals between points of various temporal separations in the time series (lag). Ideally, for non-zero lag, the ACF bars are within the blue range bars shown below. It’s important to use

dfts <- ts(df$IPG3113N, frequency=12, start=c(1972,1)) dfts

3 because the last value of $residuals is always NA, and the function will otherwise error-out.

A Ljung-Box test can also indicate the presence of these correlations. As long as we score a p-value > 0.05, there is a 95% chance the residuals are independent. Finally, it’s useful to check the histogram of the residuals to ensure a normal distribution. If the residuals are heavily skewed, then our model may be consistently overshooting in one direction.

acf(HW1_for$residuals, lag.max=20, na.action=na.pass) Box.test(HW1_for$residuals, lag=20, type="Ljung-Box") hist(HW1_for$residuals)

Evaluating $residuals

Conclusions

As you can see, Holt-Winters forecasting is a powerful tool for predicting future data in a time series. Many other time series forecasting models exist (ARIMA, TBATS, etc), but almost all follow this same basic code sequence.

You may have also noticed that the code embedded in this article makes a great template for forecasting just about any data — just keep an eye on

dfts <- ts(df$IPG3113N, frequency=12, start=c(1972,1)) dfts

4 and

dfts <- ts(df$IPG3113N, frequency=12, start=c(1972,1)) dfts

5 when converting into a time series, and you should be all set! To save you some time and hassle, the source code can be found on my github:

Chủ đề