question with r code in data science

1. Read and provide a one page summary of the lme4 documentation.
2. Write down the likelihood function for 25 observations from an iid normal
model (you can write the likelihood for a single observation as
dnorm(y; µ; σ)).
(a) How would you use restricted maximum likelihood (REML) to treat
µ as
a nuisance parameter?
(b) Write down the likelihood function for the following random effects
model with 25 observations and 5 random effect levels:
yIJ = β0 + β1 ∗ αJ + I
where I ∼ N(0; σy2), and αJ ∼ N(0; σA2 ).
(c) How would you use REML to fit the above model?
3. The data frame
Gun (library nlme) is from a trial examining methods for firing naval guns. Two firing methods were compared, with each of a number
of teams of 3 gunners; the gunners in each team were matched to have
similar physique (Slight, Average, Heavy). The response variable
rounds
is rounds fired per minute, and there are 3 explanatory factor variables,
Physique (levels Slight, Medium and Heavy); Method (levels M1 and M2)
and
Team with 9 levels. The main interest is in determining which method
and/or physique results in the highest firing rate and in quantifying teamto-team variability in firing rate.
(a) Identify which factors should be treated as random and which as fixed,
in the analysis of these data.
(b) Write out a suitable mixed model as a starting point for the analysis of
these data.
(c) Analyse the data using
lme in order to answer the main questions of
interest and report your conclusions.
4. The Carseats dataset from the R package
ISLR is a simulated dataset of
carseat sales at 400 different stores. Full information on the variables in
this dataset can be found using
help(Carseats) after loading the package.
(a) Create a new factor variable for the Carseats representing whether or
not Sales is greater than 8. Randomly split the dataset into a testing
and training set. On the training set grow a classification tree using the
R
rpart package to classify whether a store had high carseat sales or not
(Hint: Remove the Sales variable). Report the classification accuracy
you got on the testing data set and on the training set.
1

(b) Prune the tree you grew in part a. and report the pruned tree’s classification accuracy on the testing data set and on the training set. Why
might pruning have improved the classification accuracy on the testing
set? Why might it have reduced accuracy on the training set?
(c) Grow a random forest using the randomForest package the same way
you did the tree. Is performance on the testing set better than the
classification trees? Why might that be the case?
(d) Briefly outline the similarities and differences between CARTs and random forests.
5. For the following data scenarios which of the three spatial approaches (geostatistics, lattice data, point patterns) is appropriate and why?
(a) The locations of petty crimes that occurred in the past week are plotted
on a street map of Chicago.
(b) Trees in an orchard are examined and their disease status (infected/not
infected) is recorded. We are interested in the spatial characteristics of
the disease, such as contagion between neighbouring trees.

(c) Earthquake aftershocks in Japan are detected and the epicenter latitude and longitude and the time of occurrence are recorded.