Predicting Bike rentals

Two competing regressions and some machine learning

counts

bike rentals

machine learning

Author

BonganI Ncube

Published

6 February 2024

‍> Citation: The data used in this exercise is derived from Capital Bikeshare and is used in accordance with the published license agreement.

# Load the data into the R session
bike_data <- read_csv("bike_data.csv")  
  

# View first few rows
bike_data %>% 
  slice_head(n = 7)

# A tibble: 7 × 15
   ...1 instant dteday  season    yr  mnth holiday weekday workingday weathersit
  <dbl>   <dbl> <chr>    <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
1     1       1 1/1/20…      1     0     1       0       6          0          2
2     2       2 1/2/20…      1     0     1       0       0          0          2
3     3       3 1/3/20…      1     0     1       0       1          1          1
4     4       4 1/4/20…      1     0     1       0       2          1          1
5     5       5 1/5/20…      1     0     1       0       3          1          1
6     6       6 1/6/20…      1     0     1       0       4          1          1
7     7       7 1/7/20…      1     0     1       0       5          1          2
# ℹ 5 more variables: temp <dbl>, atemp <dbl>, hum <dbl>, windspeed <dbl>,
#   rentals <dbl>

# Take a quick glance at the data
glimpse(bike_data)

Rows: 731
Columns: 15
$ ...1       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ instant    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ dteday     <chr> "1/1/2011", "1/2/2011", "1/3/2011", "1/4/2011", "1/5/2011",…
$ season     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ yr         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mnth       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ holiday    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ weekday    <dbl> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
$ workingday <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
$ weathersit <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
$ temp       <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
$ atemp      <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
$ hum        <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
$ windspeed  <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
$ rentals    <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…

The data consists of 731 rows the following 14 columns:

instant: A unique row identifier
dteday: The date on which the data was observed - in this case, the data was collected daily; so there’s one row per date.
season: A numerically encoded value indicating the season (1-spring, 2-summer, 3-fall, 4-winter)
yr: The year of the study in which the observation was made (the study took place over two years (year 0 represents 2011, and year 1 represents 2012)
mnth: The calendar month in which the observation was made (1-January … 12-December)
holiday: A binary value indicating whether or not the observation was made on a public holiday)
weekday: The day of the week on which the observation was made (0-Sunday … 6-Saturday)
workingday: A binary value indicating whether or not the day is a working day (not a weekend or holiday)
weathersit: A categorical value indicating the weather situation (1-clear, 2-mist/cloud, 3-light rain/snow, 4-heavy rain/hail/snow/fog)
temp: The temperature in celsius (normalized)
atemp: The apparent (“feels-like”) temperature in celsius (normalized)
hum: The humidity level (normalized)
windspeed: The windspeed (normalized)
rentals: The number of bicycle rentals recorded.

In this dataset, rentals represents the label (the \(y\) value) our model must be trained to predict. The other columns are potential features (\(x\) values).

# Parse dates then extract days
bike_data <- bike_data %>%
  mutate(dteday = mdy(dteday)) %>% 
  mutate(day = day(dteday))

bike_data %>% 
  slice_head(n = 10)

# A tibble: 10 × 16
    ...1 instant dteday     season    yr  mnth holiday weekday workingday
   <dbl>   <dbl> <date>      <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>
 1     1       1 2011-01-01      1     0     1       0       6          0
 2     2       2 2011-01-02      1     0     1       0       0          0
 3     3       3 2011-01-03      1     0     1       0       1          1
 4     4       4 2011-01-04      1     0     1       0       2          1
 5     5       5 2011-01-05      1     0     1       0       3          1
 6     6       6 2011-01-06      1     0     1       0       4          1
 7     7       7 2011-01-07      1     0     1       0       5          1
 8     8       8 2011-01-08      1     0     1       0       6          0
 9     9       9 2011-01-09      1     0     1       0       0          0
10    10      10 2011-01-10      1     0     1       0       1          1
# ℹ 7 more variables: weathersit <dbl>, temp <dbl>, atemp <dbl>, hum <dbl>,
#   windspeed <dbl>, rentals <dbl>, day <int>

# load package into the R session
library(summarytools)

# Obtain summary stats for feature and label columns
bike_data %>% 
  # Select features and label
  select(c(temp, atemp, hum, windspeed, rentals)) %>% 
  # Summary stats
  descr(order = "preserve",
        stats = c('mean', 'sd', 'min', 'q1', 'med', 'q3', 'max'),
        round.digits = 6)

Descriptive Statistics  
bike_data  
N: 731  

                    temp      atemp        hum   windspeed       rentals
------------- ---------- ---------- ---------- ----------- -------------
         Mean   0.495385   0.474354   0.627894    0.190486    848.176471
      Std.Dev   0.183051   0.162961   0.142429    0.077498    686.622488
          Min   0.059130   0.079070   0.000000    0.022392      2.000000
           Q1   0.336667   0.337746   0.520000    0.134950    315.000000
       Median   0.498333   0.486733   0.626667    0.180975    713.000000
           Q3   0.655833   0.609229   0.730417    0.233221   1097.000000
          Max   0.861667   0.840896   0.972500    0.507463   3410.000000

The statistics reveal some information about the distribution of the data in each of the numeric fields, including the number of observations (there are 731 records), the mean, standard deviation, minimum and maximum values, and the quartile values (the threshold values for 25%, 50% - which is also the median, and 75% of the data).

From this, we can see that the mean number of daily rentals is around 848; but there’s a comparatively large standard deviation, indicating a lot of variance in the number of rentals per day.

library(patchwork)
# Plot a histogram
theme_set(theme_light())
hist_plt <- bike_data %>% 
  ggplot(mapping = aes(x = rentals)) + 
  geom_histogram(bins = 100, fill = "midnightblue", alpha = 0.7) +
  
  # Add lines for mean and median
  geom_vline(aes(xintercept = mean(rentals), color = 'Mean'), linetype = "dashed", size = 1.3) +
  geom_vline(aes(xintercept = median(rentals), color = 'Median'), linetype = "dashed", size = 1.3 ) +
  xlab("") +
  ylab("Frequency") +
  scale_color_manual(name = "", values = c(Mean = "red", Median = "yellow")) +
  theme(legend.position = c(0.9, 0.9), legend.background = element_blank())

# Plot a box plot
box_plt <- bike_data %>% 
  ggplot(aes(x = rentals, y = 1)) +
  geom_boxplot(fill = "#E69F00", color = "gray23", alpha = 0.7) +
    # Add titles and labels
  xlab("Rentals")+
  ylab("")


# Combine plots
(hist_plt / box_plt) +
  plot_annotation(title = 'Rental Distribution',
                  theme = theme(
                    plot.title = element_text(hjust = 0.5)))

The plots show that the number of daily rentals ranges from 0 to just over 3,400. However, the mean (and median) number of daily rentals is closer to the low end of that range, with most of the data between 0 and around 2,200 rentals. The few values above this are shown in the box plot as small circles, indicating that they are outliers - in other words, unusually high or low values beyond the typical range of most of the data.

# Create a data frame of numeric features & label
numeric_features <- bike_data %>% 
  select(c(temp, atemp, hum, windspeed, rentals))

# Pivot data to a long format
numeric_features <- numeric_features %>% 
  pivot_longer(!rentals, names_to = "features", values_to = "values") %>%
  group_by(features) %>% 
  mutate(Mean = mean(values),
         Median = median(values))


# Plot a histogram for each feature
numeric_features %>%
  ggplot() +
  geom_histogram(aes(x = values, fill = features), bins = 100, alpha = 0.7, show.legend = F) +
  facet_wrap(~ features, scales = 'free')+
  paletteer::scale_fill_paletteer_d("ggthemes::excel_Parallax") +
  
  # Add lines for mean and median
  geom_vline(aes(xintercept = Mean, color = "Mean"), linetype = "dashed", size = 1.3 ) +
  geom_vline(aes(xintercept = Median, color = "Median"), linetype = "dashed", size = 1.3 ) +
  scale_color_manual(name = "", values = c(Mean = "red", Median = "yellow"))

Note

The numeric features seem to be more normally distributed, with the mean and median nearer the middle of the range of values, coinciding with where the most commonly occurring values are.

Note: The distributions are not truly normal in the statistical sense, which would result in a smooth, symmetric “bell-curve” histogram with the mean and mode (the most common value) in the center; but they do generally indicate that most of the observations have a value somewhere near the middle.

# Create a data frame of categorical features & label
categorical_features <- bike_data %>%
  select(c(season, mnth, holiday, weekday, workingday, weathersit, day, rentals)) 

# Pivot data to a long format
categorical_features <- categorical_features %>% 
  pivot_longer(!rentals, names_to = "features", values_to = "values") %>%
  group_by(features) %>% 
  mutate(values = factor(values))


# Plot a bar plot for each feature
categorical_features %>%
  ggplot() +
  geom_bar(aes(x = values, fill = features), alpha = 0.7, show.legend = F) +
  facet_wrap(~ features, scales = 'free') +
  ggthemes::scale_fill_calc()+
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 90))

comments

Many of the categorical features show a more or less uniform distribution (meaning there’s roughly the same number of rows for each category). Exceptions to this include:

holiday: There are many fewer days that are holidays than days that aren’t.
workingday: There are more working days than non-working days.
weathersit: Most days are category 1 (clear), with category 2 (mist and cloud) the next most common. There are comparatively few category 3 (light rain or snow) days, and no category 4 (heavy rain, hail, or fog) days at all.

Now that we know something about the distribution of the data in our columns, we can start to look for relationships between the features and the rentals label we want to be able to predict.

For the numeric features, we can create scatter plots that show the intersection of feature and label values.

# Plot a scatter plot for each feature
numeric_features %>% 
  mutate(corr_coef = round(cor(values, rentals),3)) %>%
  mutate(features = paste(features, ' vs rentals, r = ', corr_coef, sep = '')) %>% 
  ggplot(aes(x = values, y = rentals, color = features)) +
  geom_point(alpha = 0.7, show.legend = F) +
  geom_smooth()+
  facet_wrap(~ features, scales = 'free')+
  paletteer::scale_color_paletteer_d("ggthemes::excel_Parallax")

Note

The correlation statistic, r, quantifies the apparent relationship. The correlation statistic is a value between -1 and 1 that indicates the strength of a linear relationship.

The results aren’t conclusive, but if you look closely at the scatter plots for temp and atemp, you can see a vague diagonal trend showing that higher rental counts tend to coincide with higher temperatures; and a correlation value of just over 0.5 for both of these features supports this observation. Conversely, the plots for hum and windspeed show a slightly negative correlation, indicating that there are fewer rentals on days with high humidity or windspeed.

Now let’s compare the categorical features to the label. We’ll do this by creating box plots that show the distribution of rental counts for each category.

# Plot a box plot for each feature
categorical_features %>%
  ggplot() +
  geom_boxplot(aes(x = values, y = rentals, fill = features), alpha = 0.9, show.legend = F) +
  facet_wrap(~ features, scales = 'free') +
  ggthemes::scale_fill_tableau()+
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 90))

The plots show some variance in the relationship between some category values and rentals. For example, there’s a clear difference in the distribution of rentals on weekends (weekday 0 or 6) and those during the working week (weekday 1 to 5). Similarly, there are notable differences for holiday and workingday categories. There’s a noticeable trend that shows different rental distributions in summer and fall months compared to spring and winter months. The weathersit category also seems to make a difference in rental distribution. The day feature we created for the day of the month shows little variation, indicating that it’s probably not predictive of the number of rentals.

Analysis of variance

options(scipen = 999)
anov_data<-bike_data |> 
  mutate(weathersit=ifelse(weathersit==1,"clear",
                           ifelse(weathersit==2,"mist/cloud",
                                  ifelse(weathersit==3,"light rain/snow","bad")))) |> 
  mutate(season=case_when(season==1~"spring",
                          season==2~"summer",
                          season==3~"fall",
                          season==4~"winter")) 

model<-aov(rentals~weathersit,data=anov_data)
summary(model)

             Df    Sum Sq  Mean Sq F value          Pr(>F)    
weathersit    2  21825551 10912776   24.65 0.0000000000439 ***
Residuals   728 322333271   442765                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note

at 5% level of significance , we can see that there is significant difference in mean rentals for each weather situation (p<0.001)
actually the boxplot entails that when its clear there are more rentals and less rentals when there is light rain and snow

which weather situations actually differ?

Ad hoc test

TukeyHSD(model)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = rentals ~ weathersit, data = anov_data)

$weathersit
                               diff        lwr       upr     p adj
light rain/snow-clear      -778.554 -1127.2142 -429.8939 0.0000006
mist/cloud-clear           -276.678  -399.8097 -153.5463 0.0000005
mist/cloud-light rain/snow  501.876   146.6629  857.0892 0.0027328

Note

its apparent that both pairs of weather types are actually significantly different to each other

Modeling data

# Select desired features and labels
bike_select <- bike_data %>% 
  select(c(season, mnth, holiday, weekday, workingday, weathersit,
           temp, atemp, hum, windspeed, rentals)) %>% 
  mutate(across(1:6, factor))

# Get a glimpse of your data
glimpse(bike_select)

Rows: 731
Columns: 11
$ season     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ mnth       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ holiday    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ weekday    <fct> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
$ workingday <fct> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
$ weathersit <fct> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
$ temp       <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
$ atemp      <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
$ hum        <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
$ windspeed  <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
$ rentals    <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…

# Load the Tidymodels packages
library(tidymodels)

# Split 70% of the data for training and the rest for tesing
set.seed(1234)
bike_split <- bike_select %>% 
  initial_split(prop = 0.7,
  # splitting data evenly on the holiday variable
                strata = holiday)

# Extract the data in each split
bike_train <- training(bike_split)
bike_test <- testing(bike_split)

glue::glue(
  'Training Set: {nrow(bike_train)} rows
  Test Set: {nrow(bike_test)} rows')

Training Set: 511 rows
Test Set: 220 rows

This results into the following two datasets:

bike_train: subset of the dataset used to train the model.
bike_test: subset of the dataset used to validate the model.

statistical models

linear model specification

# Build a linear model specification
lm_spec <- 
  linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

# Train a linear regression model
lm_mod <- lm_spec %>% 
  fit(rentals ~ ., data = bike_train)

# Print the model object
lm_mod |> 
  pluck("fit") |> 
  summary()


Call:
stats::lm(formula = rentals ~ ., data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1172.65  -244.54   -23.08   198.72  1590.65 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   817.923    144.698   5.653         0.0000000271 ***
season2       114.092    108.707   1.050             0.294453    
season3        11.549    133.000   0.087             0.930839    
season4       -19.409    113.847  -0.170             0.864702    
mnth2          -3.148     89.909  -0.035             0.972084    
mnth3         263.326     98.680   2.668             0.007876 ** 
mnth4         322.323    149.628   2.154             0.031721 *  
mnth5         333.025    162.546   2.049             0.041021 *  
mnth6          -6.162    171.991  -0.036             0.971433    
mnth7        -102.075    191.010  -0.534             0.593315    
mnth8          57.016    186.446   0.306             0.759886    
mnth9         256.009    164.858   1.553             0.121100    
mnth10        394.514    149.193   2.644             0.008452 ** 
mnth11        227.977    142.956   1.595             0.111427    
mnth12         73.813    115.831   0.637             0.524262    
holiday1      520.167    123.344   4.217         0.0000295411 ***
weekday1     -779.924     64.952 -12.008 < 0.0000000000000002 ***
weekday2     -807.415     62.693 -12.879 < 0.0000000000000002 ***
weekday3     -837.799     66.322 -12.632 < 0.0000000000000002 ***
weekday4     -827.469     64.476 -12.834 < 0.0000000000000002 ***
weekday5     -664.329     62.547 -10.621 < 0.0000000000000002 ***
weekday6      113.438     63.090   1.798             0.072799 .  
workingday1        NA         NA      NA                   NA    
weathersit2   -74.156     47.715  -1.554             0.120804    
weathersit3  -431.901    128.040  -3.373             0.000803 ***
temp         2798.651    757.840   3.693             0.000247 ***
atemp        -722.015    781.743  -0.924             0.356158    
hum          -682.890    181.749  -3.757             0.000193 ***
windspeed   -1033.002    249.611  -4.138         0.0000412571 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 388.2 on 483 degrees of freedom
Multiple R-squared:  0.7076,    Adjusted R-squared:  0.6913 
F-statistic:  43.3 on 27 and 483 DF,  p-value: < 0.00000000000000022

So, these are the coefficients that the model learned during training.

Evaluate the Trained Model

# Predict rentals for the test set and bind it to the test_set
results <- bike_test %>% 
  bind_cols(lm_mod %>% 
    predict(new_data = bike_test) %>% 
      rename(predictions = .pred))

# Compare predictions
results %>% 
  select(c(rentals, predictions)) %>% 
  slice_head(n = 10)

# A tibble: 10 × 2
   rentals predictions
     <dbl>       <dbl>
 1     120      -104. 
 2      82       -41.7
 3      88       -52.5
 4      68       560. 
 5      54       451. 
 6     251       772. 
 7     117       281. 
 8      83      -106. 
 9      75      -141. 
10      86      -274.

Caution

this is one of the many draw backs of linear regression…(predicting negative values)

Visualise the results

results %>% 
  ggplot(mapping = aes(x = rentals, y = predictions)) +
  geom_point(size = 1.6, color = "steelblue") +
  # Overlay a regression line
  geom_smooth(method = "lm", se = F, color = 'magenta') +
  ggtitle("Daily Bike Share Predictions") +
  xlab("Actual Labels") +
  ylab("Predicted Labels") +
  theme(plot.title = element_text(hjust = 0.5))

Tip

🕵 📈There’s a definite diagonal trend, and the intersections of the predicted and actual values are generally following the path of the trend line; but there’s a fair amount of difference between the ideal function represented by the line and the results. This variance represents the residuals of the model - in other words, the difference between the label predicted when the model applies the coefficients it learned during training to the validation data, and the actual value of the validation label. These residuals when evaluated from the validation data indicate the expected level of error when the model is used with new data for which the label is unknown.

Note

You can quantify the residuals by calculating a number of commonly used evaluation metrics. We’ll focus on the following three:

Mean Square Error (MSE): The mean of the squared differences between predicted and actual values. This yields a relative metric in which the smaller the value, the better the fit of the model
Root Mean Square Error (RMSE): The square root of the MSE. This yields an absolute metric in the same unit as the label (in this case, numbers of rentals). The smaller the value, the better the model (in a simplistic sense, it represents the average number of rentals by which the predictions are wrong!)
Coefficient of Determination (usually known as R-squared or R2): A relative metric in which the higher the value, the better the fit of the model. In essence, this metric represents how much of the variance between predicted and actual label values the model is able to explain.

# Multiple regression metrics
eval_metrics <- metric_set(rmse, rsq)

# Evaluate RMSE, R2 based on the results
eval_metrics(data = results,
             truth = rentals,
             estimate = predictions) %>% 
  select(-2)

# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 rmse      375.   
2 rsq         0.676

Note

the result is not bad , a larger rsq (67% of the variation in the response is explained by the model)and some rmse value though a bit large when expected to be small

Fitting a poisson regression

initialize a specification

pois_spec <- poisson_reg() %>% 
  set_mode("regression") %>% 
  set_engine("glm")

# Train a linear regression model
pois_mod <- pois_spec %>% 
  fit(rentals ~ ., data = bike_train)

# Print the model object
pois_mod |> 
  pluck("fit") |> 
  summary()


Call:
stats::glm(formula = rentals ~ ., family = stats::poisson, data = data)

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error  z value            Pr(>|z|)    
(Intercept)  5.760943   0.015999  360.079 <0.0000000000000002 ***
season2      0.222331   0.010273   21.642 <0.0000000000000002 ***
season3      0.123923   0.012260   10.108 <0.0000000000000002 ***
season4      0.216147   0.013405   16.125 <0.0000000000000002 ***
mnth2        0.224541   0.015095   14.875 <0.0000000000000002 ***
mnth3        0.951898   0.013650   69.736 <0.0000000000000002 ***
mnth4        0.958106   0.016912   56.652 <0.0000000000000002 ***
mnth5        0.832624   0.017835   46.685 <0.0000000000000002 ***
mnth6        0.504826   0.018778   26.884 <0.0000000000000002 ***
mnth7        0.392153   0.020154   19.458 <0.0000000000000002 ***
mnth8        0.663365   0.019563   33.909 <0.0000000000000002 ***
mnth9        0.798885   0.018513   43.152 <0.0000000000000002 ***
mnth10       0.910833   0.018860   48.294 <0.0000000000000002 ***
mnth11       0.672912   0.018665   36.052 <0.0000000000000002 ***
mnth12       0.304158   0.017118   17.768 <0.0000000000000002 ***
holiday1     0.569193   0.009461   60.165 <0.0000000000000002 ***
weekday1    -0.805677   0.005763 -139.812 <0.0000000000000002 ***
weekday2    -0.885363   0.005833 -151.794 <0.0000000000000002 ***
weekday3    -0.959943   0.006328 -151.690 <0.0000000000000002 ***
weekday4    -0.853096   0.005649 -151.014 <0.0000000000000002 ***
weekday5    -0.634930   0.005174 -122.712 <0.0000000000000002 ***
weekday6     0.099508   0.004327   22.998 <0.0000000000000002 ***
workingday1        NA         NA       NA                  NA    
weathersit2 -0.104121   0.004281  -24.320 <0.0000000000000002 ***
weathersit3 -1.440196   0.024427  -58.960 <0.0000000000000002 ***
temp         2.486540   0.059626   41.702 <0.0000000000000002 ***
atemp       -0.016302   0.061458   -0.265               0.791    
hum         -0.711811   0.015707  -45.319 <0.0000000000000002 ***
windspeed   -1.021681   0.023355  -43.745 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 265882  on 510  degrees of freedom
Residual deviance:  50935  on 483  degrees of freedom
AIC: 55192

Number of Fisher Scoring iterations: 4

# Predict rentals for the test set and bind it to the test_set
results <- bike_test %>% 
  bind_cols(pois_mod %>% 
    predict(new_data = bike_test) %>% 
      rename(predictions = .pred))

# Compare predictions
results %>% 
  select(c(rentals, predictions)) %>% 
  slice_head(n = 10)

# A tibble: 10 × 2
   rentals predictions
     <dbl>       <dbl>
 1     120        131.
 2      82        129.
 3      88        141.
 4      68        247.
 5      54        227.
 6     251        329.
 7     117        195.
 8      83        130.
 9      75        131.
10      86        108.

there you go

poisson model gives us positive predictions ,at least thats what we wanted .

# Visualise the results
results %>% 
  ggplot(mapping = aes(x = rentals, y = predictions)) +
  geom_point(size = 1.6, color = "steelblue") +
  # Overlay a regression line
  geom_smooth(method = "lm", se = F, color = 'magenta') +
  ggtitle("Daily Bike Share Predictions") +
  xlab("Actual Labels") +
  ylab("Predicted Labels") +
  theme(plot.title = element_text(hjust = 0.5))

# Multiple regression metrics
eval_metrics <- metric_set(rmse, rsq)

# Evaluate RMSE, R2 based on the results
eval_metrics(data = results,
             truth = rentals,
             estimate = predictions) %>% 
  select(-2)

# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 rmse      351.   
2 rsq         0.714

Note

we can see that rmse significantly dropped and rsq significantly increased which is exactly what we expected to find after fitting and more suitable ,model

lasso regression

Here we’ll set up one model specification for lasso regression; I picked a value for penalty (sort of randomly) and I set mixture = 1 for lasso (When mixture = 1, it is a pure lasso model).

For convenience, we’ll rewrite the whole code from splitting data to evaluating model performance.

# Split 70% of the data for training and the rest for tesing
set.seed(1234)
bike_split <- bike_select %>% 
  initial_split(prop = 0.7, strata = holiday)

# Extract the data in each split
bike_train <- training(bike_split)
bike_test <- testing(bike_split)

# Build a lasso model specification
lasso_spec <-
  linear_reg(penalty = 1, mixture = 1) %>% 
  set_engine('glmnet') %>% 
  set_mode('regression')

# Train a lasso regression model
lasso_mod <- lasso_spec %>% 
  fit(rentals ~ ., data = bike_train)

# Make predictions for test data
results <- bike_test %>% 
  bind_cols(lasso_mod %>% predict(new_data = bike_test) %>% 
              rename(predictions = .pred))

# Evaluate the model
lasso_metrics <- eval_metrics(data = results,
                                    truth = rentals,
                                    estimate = predictions) %>%
  select(-2)


# Plot predicted vs actual
lasso_plt <- results %>% 
  ggplot(mapping = aes(x = rentals, y = predictions)) +
  geom_point(size = 1.6, color = 'darkorchid') +
  # overlay regression line
  geom_smooth(method = 'lm', color = 'black', se = F) +
  ggtitle("Daily Bike Share Predictions") +
  xlab("Actual Labels") +
  ylab("Predicted Labels") +
  theme(plot.title = element_text(hjust = 0.5))

# Return evaluations
list(lasso_metrics, lasso_plt)

[[1]]
# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 rmse      373.   
2 rsq         0.680

[[2]]

Caution

apparently there is no much of an improvement

Machine learning Algorithms

As an alternative to a linear model, there’s a category of algorithms for machine learning that uses a tree-based approach in which the features in the dataset are examined in a series of evaluations, each of which results in a branch in a decision tree based on the feature value. At the end of each series of branches are leaf-nodes with the predicted label value based on the feature values.

# Build a decision tree specification
tree_spec <- decision_tree() %>% 
  set_engine('rpart') %>% 
  set_mode('regression')

# Train a decision tree model 
tree_mod <- tree_spec %>% 
  fit(rentals ~ ., data = bike_train)

# Print model
tree_mod

parsnip model object

n= 511 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 511 248953000.0  871.3973  
   2) temp< 0.41875 194  22958580.0  384.3763  
     4) atemp< 0.3207 105   2980283.0  233.8571 *
     5) atemp>=0.3207 89  14792870.0  561.9551  
      10) workingday=1 55   5390687.0  389.7091 *
      11) workingday=0 34   5130762.0  840.5882 *
   3) temp>=0.41875 317 151819200.0 1169.4480  
     6) workingday=1 219  19336580.0  807.8174  
      12) hum>=0.8485415 16    425373.4  308.6875 *
      13) hum< 0.8485415 203  14610940.0  847.1576 *
     7) workingday=0 98  39840680.0 1977.5820  
      14) hum>=0.8322915 7   1597155.0 1200.1430 *
      15) hum< 0.8322915 91  33687190.0 2037.3850  
        30) weekday=0,1,5 54  17912780.0 1871.3890 *
        31) weekday=3,6 37  12114860.0 2279.6490 *

So now we have a tree-based model; but is it any good? Let’s evaluate it with the test data.

# Make predictions for test data
results <- bike_test %>% 
  bind_cols(tree_mod %>% predict(new_data = bike_test) %>% 
              rename(predictions = .pred))

# Evaluate the model
tree_metrics <- eval_metrics(data = results,
                                  truth = rentals,
                                  estimate = predictions) %>% 
  select(-2)


# Plot predicted vs actual
tree_plt <- results %>% 
  ggplot(mapping = aes(x = rentals, y = predictions)) +
  geom_point(color = 'tomato') +
  # overlay regression line
  geom_smooth(method = 'lm', color = 'steelblue', se = F) +
  ggtitle("Daily Bike Share Predictions") +
  xlab("Actual Labels") +
  ylab("Predicted Labels") +
  theme(plot.title = element_text(hjust = 0.5))

# Return evaluations
list(tree_metrics, tree_plt)

[[1]]
# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 rmse      372.   
2 rsq         0.681

[[2]]

Warning

model did not perform well as expected

Ensemble Algorithm

let’s try a Random Forest model, which applies an averaging function to multiple Decision Tree models for a better overall model.

# Build a random forest model specification
rf_spec <- rand_forest() %>% 
  set_engine('randomForest') %>% 
  set_mode('regression')

# Train a random forest model 
rf_mod <- rf_spec %>% 
  fit(rentals ~ ., data = bike_train)


# Print model
rf_mod

parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 3

          Mean of squared residuals: 106838.1
                    % Var explained: 78.07

So now we have a random forest model; but is it any good? Let’s evaluate it with the test data.

# Make predictions for test data
results <- bike_test %>% 
  bind_cols(rf_mod %>% predict(new_data = bike_test) %>% 
              rename(predictions = .pred))

# Evaluate the model
rf_metrics <- eval_metrics(data = results,
                                  truth = rentals,
                                  estimate = predictions) %>% 
  select(-2)


# Plot predicted vs actual
rf_plt <- results %>% 
  ggplot(mapping = aes(x = rentals, y = predictions)) +
  geom_point(color = '#6CBE50FF') +
  # overlay regression line
  geom_smooth(method = 'lm', color = '#2B7FF9FF', se = F) +
  ggtitle("Daily Bike Share Predictions") +
  xlab("Actual Labels") +
  ylab("Predicted Labels") +
  theme(plot.title = element_text(hjust = 0.5))

# Return evaluations
list(rf_metrics, rf_plt)

[[1]]
# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 rmse      298.   
2 rsq         0.793

[[2]]

Note

the rmse significantly decreased by a greater value and \(R^2\) shot upwards
the model did well

Boosted algorithm

lets demonstrate how to implement Gradient Boosting Machines using the xgboost engine.

# Build an xgboost model specification
boost_spec <- boost_tree() %>% 
  set_engine('xgboost') %>% 
  set_mode('regression')

# Train an xgboost model 
boost_mod <- boost_spec %>% 
  fit(rentals ~ ., data = bike_train)


# Print model
boost_mod

parsnip model object

##### xgb.Booster
raw: 57.9 Kb 
call:
  xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
    colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
    subsample = 1), data = x$data, nrounds = 15, watchlist = x$watchlist, 
    verbose = 0, nthread = 1, objective = "reg:squarederror")
params (as set within xgb.train):
  eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 34 
niter: 15
nfeatures : 34 
evaluation_log:
    iter training_rmse
       1      816.4887
       2      608.7446
---                   
      14      111.2692
      15      103.5075

# Make predictions for test data
results <- bike_test %>% 
  bind_cols(boost_mod %>% predict(new_data = bike_test) %>% 
              rename(predictions = .pred))

# Evaluate the model
boost_metrics <- eval_metrics(data = results,
                                  truth = rentals,
                                  estimate = predictions) %>% 
  select(-2)


# Plot predicted vs actual
boost_plt <- results %>% 
  ggplot(mapping = aes(x = rentals, y = predictions)) +
  geom_point(color = '#4D3161FF') +
  # overlay regression line
  geom_smooth(method = 'lm', color = 'black', se = F) +
  ggtitle("Daily Bike Share Predictions") +
  xlab("Actual Labels") +
  ylab("Predicted Labels") +
  theme(plot.title = element_text(hjust = 0.5))

# Return evaluations
list(boost_metrics, boost_plt)

[[1]]
# A tibble: 2 × 2
  .metric .estimate
  <chr>       <dbl>
1 rmse      306.   
2 rsq         0.784

[[2]]

amazing , xgboost performed even better , rmse decreased by almost \(50\%\) which is a step in the right direction
rsq significantly increased upto the \(94\%\) showing a better improvement