Answering CS1B Actuarial Exam Question In R

Interactive learning

Author

Bongani Ncube

Subject CS1 – Actuarial Statistics Core Principles

To test that you have successfully loaded the required R Packages, run the following codes in order within R: Code:

Note
  • The exam requires that you load each dataset on each question using the function load()
  • I will do it at once
  • Do not run the chunk below as it wont run , it was only for demonstrative purposes
Note
  • Run the following chunks below as it will read in all the data needed for the exercise
Question 1

An insurance company wants to study the association between the number of years their clients spent in education and their claim amounts. Data from 25 randomly selected claims are contained in the file AmountYears.RData in the following two variables:

  • ClaimAmount – this is the claim amount (in £).
  • EducationYears – this is the number of years the client spent in education.
(i) Plot the claim amounts against the years of education.

use the plot() function , to learn more type ?plot in the console

If I were to use packages i would use the ggplot package ,but dont use it

(ii) (a) Fit a linear model to the data.
(b) Plot the regression line by adding it to the graph in part (i).
(iii) (a) Fit a model with a quadratic term added to the model fitted in part (ii).
The fitted model is:

\[ClaimAmount = 609272 − 95999 × EducationYears + 4371 × EducationYears2\]

(iv) Comment on the suitability of the quadratic model in part (iii),

compared to the model in part (ii), based on the output from part (iii).

Comment

Note
  • the quadratic model seems more suitable from the plot below as it traces more points
  • \(R^2\) for the quadratic model has improved significantly as compared to the linear model
Question 2 :

A financial consultancy working with large firms wishes to model the relationship between a firm’s assets and the number of senior management positions in the firm. The data file firms.Rdata contains the variables:

  • assets – this is the value of assets (in millions of £).
  • sn_positions – this is the number of senior managements positions.
(i) Plot the number of senior management positions as a function of assets.
(ii) Plot the number of senior management positions as a function of log10_assets where log10_assets is the assets at log10 scale.

Mean

Sample size

(v) Generate a sample of size equal to the number of firms from a Poisson distribution with parameter equal to the mean calculated in part (iv).
Plot a histogram of the sample simulated in part (v) and a histogram of sn_positions on

two separate graphs but on the same scale specifying appropriate axis limits and labels.

Note

A Multiple-Choice (MC) test with 20 questions requires a minimum of 16 correct answers for students to pass the test. A student prepares for the test using a mobile phone application that generates random practice tests with 20 questions per test.

Load the file MCtestResults.Rdata into R. This creates two variables:

  • CorrectOutOf20Questions – this contains the number of correct answers the student has achieved with the mobile phone application in each of 50 generated practice MC tests.
  • TrialNumber – this contains the corresponding test number from 1 to 50.

The student assumes that the test score, X, which is the number of correctly answered questions per test, has a binomial distribution,

\(X \sim Bin(n,p)\) with n=20.

(i) Estimate the parameter 𝑝 using the test scores in MCtestResults.Rdata, assuming that the test scores are independent of each other and identically distributed

How do we find p

  • there are 50 tests generated
  • each test is marked out of 20
  • if he where to get everything correct then he who would get 50*20 to give 1000
  • what he actually got is equal to sum of marks in column CorrectOutOf20Questions
(ii) Calculate the probability that the student will pass a test based on your estimate of 𝑝 in part (i).

Probability of passing the test

  • this requires to have achieved a mark of 15 or better i.e P(X>15)
(iii) Calculate the proportion of practice tests that the student has passed.

Proportion of tests that he passed

(v) Plot the number of correct answers in each of the practice tests against the test number on the horizontal axis
Consider a random variable, 𝑋, following a modified exponential distribution with Cumulative Distribution Function (CDF):

\[f(x) =1 -exp(-\lambda x^2 )\]

(i) Plot the CDF F(x) as a function of x for x =0.1, 0.2, … ,9.9, 10 when .
Note

A random sample of 100 values of X is provided in randomSample.Rdata. Loading the sample data into R will generate a vector x with 100 values representing the sample.

    1. Calculate the value of the log likelihood function for the parameter λ at the point \(\lambda =2\) based on this random sample.
    1. Plot the values of the log likelihood function for the parameter λ based on the sample in randomSample.Rdata. Your plot of the log likelihood function must be for values of λ = 0.01, 0.02, … , 0.99, 1.

The maximum likelihood estimator for the parameter λ based on a random sample \(X_1 , ..., X_N\) is given by \[\hat{\lambda} = \frac{N}{\sum^N_{i=1}X_{i}^2}\]

(v) Estimate the value of λ using the maximum likelihood estimator given above and the sample in randomSample.Rdata.
Question 5 : An insurance company, which currently only sells home insurance, is interested in

entering the car insurance market. An underwriting manager at the company believes that the age and gender of the policyholder will be the most important factors in estimating the number of claims made under a car insurance policy. The underwriting manager has commissioned a survey of its current home insurance customers who also have car insurance, choosing a male customer and a female customer for every age from 18 to 65, asking them how many car insurance claims they have made in the past 3 years. This dataset is saved in the file ClaimsData.Rdata. After loading this data into R, using the command load(“ClaimsData.Rdata”), the data frame ClaimsData will be available, which contains the following three variables:

  • age – this is the age (in years) of the policyholder.
  • gender – this is either ‘M’ for male or ‘F’ for female.
  • claim_count – this is the number of car insurance claims reported by the policyholder over the past 3 years.
Fit a normal linear regression model to the data using claim_count as the response variable and age as the explanatory variable. Your answer should include the estimated intercept and slope of the regression line.
Note
  1. Fit a Generalised Linear Model (GLM) to the data using claim_count as the response variable and age as the explanatory variable, assuming a Poisson distribution for the response variable. Your answer should include the estimated coefficients and the Akaike’s Information Criterion (AIC) of the fitted model.

The underwriting manager wishes to compare the fit of the GLM in part (ii) against that of the normal linear regression model in part (i).

Note
    1. Fit, by choosing a suitable argument for family in the glm command, a GLM to the data that is equivalent to the model fitted in part (i). > Your answer should include the estimated coefficients and the AIC of this fitted model.
Note

The underwriting manager believes the Poisson GLM would be improved by adding the explanatory variable gender as well as its interaction with age.

      1. Fit a Poisson GLM to the data of the form age*gender. Your answer should include the estimated coefficients and the AIC of this fitted model.
    1. Compare, using scaled deviances, the fit of this model to that in part (ii).