The goal of this project is to build a model to predict the manner in which an exercise was performed using data from accelerometers on the belt, forearm, arm, and dumbell of 6 young male healthy participants. While wearing sensors, they were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
For more details, see the original paper Qualitative Activity Recognition of Weight Lifting Exercises from which the data were taken.
if(!file.exists("training.csv")) {
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "training.csv")}
library(dplyr); library(caret); library(rattle); library(xtable); library(ggplot2); library(reshape2)
library(gridExtra); library(randomForest); library(sessioninfo)
set.seed(2410)
The original training dataset consists of 19622 measurements on 160 variables. Among those variables, 67 contain a majority of missing values as NAs, while 33 consist mostly of empty character strings, and thus have been removed.
Also, index of the measurements and date are not relevant for the purpose of our prediction model, as well as the new_window column that shows very little variance. Finally, character variables, among which the output of interest (the “classe” column) were converted in factors.
training<-read.csv("training.csv")
nacount<-apply(training,2, function(x) sum(is.na(x)))
table(nacount)
## nacount
## 0 19216
## 93 67
training<-training[,-which(nacount==19216)]
emptycount<-apply(training,2, function(x) sum(x==""))
table(emptycount)
## emptycount
## 0 19216
## 60 33
training<-training[,-which(emptycount==19216)]
training<-training[,-c(1,5)]
nearZeroVar(training)
## [1] 4
table(training$new_window)
##
## no yes
## 19216 406
training<-training[,-4]
char<-logical()
for (i in 1:ncol(training)) {char[i]<-class(training[,i])=="character"}
apply(training[,which(char)],2,unique)
## $user_name
## [1] "carlitos" "pedro" "adelmo" "charles" "eurico" "jeremy"
##
## $classe
## [1] "A" "B" "C" "D" "E"
training<-mutate(training, user_name=as.factor(user_name), classe=as.factor(classe))
At this point, most of the remaining variables are sensors outputs, thus good candidate predictors. Only variables “num_window”, “raw_timestamp_part_1” and “raw_timestamp_part_2” don’t have a clear possible phisical relationship with the output. By plotting their relationship with “classe”, I decided to remove the timestamp variables.
par(mfrow=c(1,3))
boxplot(training$num_window~training$classe, xlab="class", ylab=NULL, main="num_window")
boxplot(training$raw_timestamp_part_1/1000~training$classe,
ylim=range(training$raw_timestamp_part_1/1000),
xlab="class", ylab=NULL, main="raw_timestamp_part_1/1000")
boxplot(training$raw_timestamp_part_2~training$classe,
xlab="class", ylab=NULL, main="raw_timestamp_part_2")
Figure 1
training<-training[,-c(2,3)]
Concerning the “num_window” variable, although not evident from the plot, during model selection it turned out to be a perfect predictor when using the Random Forest method: in fact, it can alone predict with accuracy=1 all the data points in the training dataset. This indicates that, in a given window, only one type of exercise was performed, as one can verify with the tibble produced by the code below. So, this variable has to be removed if we want to obtain a meaningful model, telling us something about the relationship between the sensor measurements and the correctness of the exercise.
group_by(training, num_window) %>% summarize(classes=unique(classe))
## # A tibble: 858 x 2
## num_window classes
## * <int> <fct>
## 1 1 E
## 2 2 E
## 3 3 E
## 4 4 E
## 5 5 E
## 6 6 E
## 7 7 E
## 8 8 E
## 9 9 E
## 10 10 E
## # ... with 848 more rows
training<-training[,-2]
In order to limit computing time in subsequent steps, I looked for other variables to exclude by examining the difference in average values, for each variable and for every possible pair of classes. I used t.test and reported the corresponding p values. Only 3 variables don’t show significant difference in mean at \(\alpha\)=0.05 for any couple of classes, and there were removed from the list of predictors.
couples<-combn(levels(training$classe),2)
pvals<-matrix(NA,ncol(training[,-54]),ncol(couples))
row.names(pvals)<-names(training[,-54])
colnames(pvals)<-apply(couples,2, function(x) paste(x[1],x[2]))
for (i in 1:nrow(pvals)) {
for (j in 1:ncol(pvals)) {
df<-filter(training, classe==couples[1,j]|classe==couples[2,j])[,c(i,54)]
pvals[i,j]<-round(t.test(as.numeric(df[,1]) ~ df$classe)$p.value,3)
}
}
minp<-apply(pvals,1,min)
which(minp>=.05)
## gyros_arm_x gyros_arm_z gyros_forearm_y
## 19 21 46
training<-select(training, -c(19,21,46))
As a starting point in model selection, a decision tree with method “rpart” was fitted. We can see that it uses only 4 predictors, ignoring the others, and doesn’t have a very good performance. In particular, it completely fails in predicting the “D” class output.
fit_rpart<-train(classe~., data=training, method="rpart")
fancyRpartPlot(fit_rpart$finalModel)
Figure 2
fit_rpart$finalModel$frame$var[fit_rpart$finalModel$frame$var %in% names(training)]
## [1] "roll_belt" "pitch_forearm" "magnet_dumbbell_y"
## [4] "roll_forearm"
confusionMatrix(training$classe, predict(fit_rpart))$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.4955662 0.3406638 0.4885461 0.5025876 0.5208949
## AccuracyPValue McnemarPValue
## 1.0000000 NaN
To improve the accuracy, a more sophisticated method is required, and a greater number of predictors as well. My strategy was to fit a Random Forest model using, once again, the entire training dataset and all the 50 predictors, with the aim of ordering them based on their relative importance (using the Gini coefficent).
As shown by the confusion matrix, this model has an excellent performance on the training set, but it’s far too complicated and it’s not worth testing it via a cross validation technique.
fit_rf<-train(classe~., data=training, method="rf")
confusionMatrix(training$classe, predict(fit_rf))$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 1.0000000 1.0000000 0.9998120 1.0000000 0.2843747
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
imp<-as.data.frame(fit_rf$finalModel$importance)
imp<-arrange(imp, -MeanDecreaseGini)
varImpPlot(fit_rf$finalModel, main="Variable importance")
Figure 3
The following step was to fit several Random Forest models with an increasing number of predictors, added one at a time according to the importance ranking previously established. As we are not provided a testing dataset with known outcomes, out of sample accuracy was estimated with a k-fold approach (5 iterations for each set of predictors tested).
s<-sample(1:nrow(training))
training_resampled<-training[s,]
l<-floor(nrow(training)/5)
mean_insample<-numeric()
mean_outsample<-numeric()
for(i in 1:50) { #As the peak in out-of-sample accuracy was already achieved, the loop was interrupted at i=12 (after 13hrs of computation)
for (k in floor(seq(1,nrow(training), len=6))[-6]) {
insample<-numeric(); outsample<-numeric()
subtest<-select(training_resampled[k:(k+l),],c(row.names(imp)[1:i], classe))
subtrain<-select(training_resampled[-(k:(k+l)),],c(row.names(imp)[1:i], classe))
fit<-train(classe ~., data=subtrain, method="rf")
print(paste(Sys.time(),"i=",i,"k=",k))
insample<-c(insample,confusionMatrix(subtrain$classe, predict(fit))$overall[1])
outsample<-c(outsample, confusionMatrix(subtest$classe, predict(fit, subtest))$overall[1])
writeLines("insample"); print(insample)
writeLines("outsample"); print(outsample) #Some informations, here not shown, were printed out at the end of each loop, in order to supervise computation
}
mean_insample[i]<-mean(insample); mean_outsample[i]<-mean(outsample)
writeLines(paste("\ni=",i))
writeLines(c("\nmean_insample\n", mean_insample))
writeLines(c("\nmean_outsample\n", mean_outsample))
}
tab<-cbind(1:nrow(mean_insample), round(mean_insample,4), round(mean_outsample,4))
names(tab)<-c("Number_of_Predictors", "Training_Subset", "Testing_Subset")
xt<-xtable(tab, caption="Table 1: Mean accuracy by numer of predictors", digits=4)
print(xt, type="html")
Number_of_Predictors | Training_Subset | Testing_Subset | |
---|---|---|---|
1 | 1 | 0.5368 | 0.4968 |
2 | 2 | 0.9214 | 0.6724 |
3 | 3 | 0.9744 | 0.8711 |
4 | 4 | 0.9937 | 0.9429 |
5 | 5 | 0.9940 | 0.9682 |
6 | 6 | 0.9954 | 0.9755 |
7 | 7 | 0.9985 | 0.9865 |
8 | 8 | 0.9996 | 0.9880 |
9 | 9 | 1.0000 | 0.9908 |
10 | 10 | 1.0000 | 0.9893 |
11 | 11 | 1.0000 | 0.9890 |
12 | 12 | 1.0000 | 0.9890 |
mtab<-melt(tab, id.vars="Number_of_Predictors", variable.name="Predictions_on", value.name="Accuracy")
g1<-ggplot(mtab, aes(x=Number_of_Predictors, y=Accuracy, col=Predictions_on))+ geom_line()+geom_point(pch=19, col=rep(1:12,2), cex=2)+ geom_hline(yintercept=1)+ scale_x_continuous(breaks=1:12)+ ylab("Accuracy %")
g2<-g1+coord_cartesian(ylim=c(0.97,1)) + geom_vline(xintercept = 9, lty=2)+ ggtitle(NULL, "Zoom")
grid.arrange(g1, g2, ncol=2)
Figure 4
As shown, the best performance on a subset of data that was not used to train the model is about 99.1% and corresponds to the use of the first 9 predictors with the greatest Gini index (even though with 4 predictors the accuracy would already be very good). Using more predictors would lead to really small changes in accuracy (see rightmost plot), while increasing computing time and reduce interpretability.
For those reasons, our final model will be a Random Forest model on the entire training dataset using the following predictors from the original dataset:
row.names(imp)[1:9]
## [1] "roll_belt" "pitch_forearm" "yaw_belt"
## [4] "pitch_belt" "roll_forearm" "magnet_dumbbell_y"
## [7] "magnet_dumbbell_z" "accel_dumbbell_y" "accel_forearm_x"
Here are the code and some parameters for this model:
fitFinal<-train(classe ~., data=select(training,c(row.names(imp)[1:9], classe)), method="rf")
fitFinal$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.81%
## Confusion matrix:
## A B C D E class.error
## A 5559 10 9 1 1 0.003763441
## B 16 3732 44 3 2 0.017118778
## C 1 12 3400 9 0 0.006428989
## D 0 2 19 3192 3 0.007462687
## E 0 11 8 7 3581 0.007208206
Applying this model to the small test dataset provided for the quiz gave the 100% of correct answers. On largest dataset the most optimistic estimate of out of sample accuracy is, as previously shown, about 99.1% and the expected out-of-bag error is 0.81%.
The analysis was run with R 4.0.2 on a x86_64 Windows machine. Here are the versions of the most relevant packages used:
## [,1] [,2]
## [1,] "caret" "6.0-86"
## [2,] "dplyr" "1.0.4"
## [3,] "ggplot2" "3.3.3"
## [4,] "gridExtra" "2.3"
## [5,] "randomForest" "4.6-14"
## [6,] "rattle" "5.4.0"
## [7,] "reshape2" "1.4.4"
## [8,] "xtable" "1.8-4"
Word count (according to the RStudio feature) =919