Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
This course seems like a lot of fun.
You can find my preprocessed data set and the data wrangling R script from my GitHub repository.
The learning2014 dataset has been collected as an international survey of approaches to learning (you can find more information on it from here.) during 3.12.2014 - 10.1.2015. It contains 166 observations on 7 variables.
learning2014 <- read.table("learning2014.csv", header = TRUE, sep = "\t")
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The variables are gender, age, attitude, deep, stra, surf and points.
Variable | Definition | Value |
---|---|---|
age | age (in years) derived from the date of birth | integer |
gender | gender | M (Male), F (Female) |
attitude | global attitude toward statistics | 1 to 5 on the Likert-scale |
deep | questions related to deep learning | natural number |
stra | questions related to strategic learning | natural number |
surf | questions related to surface learning | natural number |
points | exam points | integer |
It seems like a fairly plausible assumption that a student’s attitude towards statistics has a big impact on exam success. We can start by plotting the variables attitude and points as a twoway scatter plot with a color indicator for gender and regression lines. (I use option include=FALSE when accessing the libraries to remove the warning messages by R from the output.)
# initialize plot with data and aesthetic mapping, define the visualization type (points), add a regression line and add a main title:
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col = gender)) + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's attitude versus exam points")
# draw the plot
p1
Next let’s consider a more detailed plot describing the variables, their distributions and relationships. This plot contains a heapload of information! First, we can infact see that the hypothesis on the relation between attitude and exam points was plausible, as the correlation between points and attitude is indeed stonger than with points and any other variable.
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
From the summary of the data we can see the min, max, median and mean values as well as the 1st and 3rd quintiles of all the variables. One can note that the sample is biased towards female respondents, that the respondents are mostly in their early twenties and have a generally more positive than negative view of statistics as a discipline. The median and mean values of the learning variables deep, stra and surf show that the respondents are more likely to engage in deep learning than surface learning.
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Next let’s see if some of the hypothesis made above hold up against a bit more sophisticated analysis. We will consider exam points as the dependent variable and the aim is to fit a regression model to the data that would explain what affects the number of exam points.
From the plots above, we can deduct that attitude, stra and surf have the highest correlations with points in absolute terms. Let’s thus choose these as our explanatory variables and fit the following linear model with multiple explanatory variables:
# create a regression model with multiple explanatory variables
points_mmodel <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(points_mmodel)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
From the summary of the model we can see the results. Under “call” we can see a recap of the model. Under “residuals” the summary prints out important information on the residuals. Recall that the residuals of a linear regressin model are assumed to be normally distributed with a zero mean and a constant variance. We can see that the median is quite close to zero, but we should conduct further statistical tests to be certain that the mean is statistically indifferent from zero.
Under coefficients we have the estimated effects of the explanatory variables on the dependent variable, their standard errors, t-values and p-values. The last two values are maybe the most important, as they tell the value of the estimates. From the printout we can immediately see that only the effects of intercept and attitude on the dependent variable are statistically significant and different from zero. The effects stra and surf are not and should be removed from the model.
Let’s then fit the model again without the two statistically insignificant variables, that is, as a simple linear regression:
# create a regression model with multiple explanatory variables
points_model <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(points_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now we can see that the median of residuals is smaller than in the multiple regression, which is a good thing. We can also see that both the effect of intercept and of attitude are now highly statistically significant, whereas before when we included also the other two variables, the intercept was slightly less significant.
From the summary of our simple linear regression model we can see that the estimated effect of attitude on points is 3.5255. This means that a change of one unit in attitude (and all other variables are held constant) corresponds to a 3.5255 increase in the exam points. Even more concretely, a student with an attitude score of 4 is expected to score 3.5255 points more on the exam than a student with an attitude score of 3. This result clearle validates the hypothesis made in part 2 of attitude towards statistics being an important predictor of success in the exam.
The “multiple R-squared” printed as a part of the summary gives the coefficient of determination, or the proportion of variance in the value of the dependent variable that is predicted by the model. In our model the intercept is included, which means that R-squared is the square of the sample correlation coefficient between the observed outcomes and the predictor values. The value of R-squared in our model is 0.1906, which means that 19.06 % of the variance of the dependent variable is explained by our model. The remaining variance is explained some other factors that we have not included in our model.
Notice that R-squared is higher in the multiple regression model, even though the additional explanatory variables are insignificant. This is because the R-squared automatically increases when any explanatory variables are added to a model. To account for this, R prints also “adjusted R-squared” that adjusts for the number of explanatory variables relative to the number of data points.
When using a linear regression model, we make two important assumptions on the data generating process: - the target variable is a linear combination of the model parameters - errors are normally distributed with zero mean and a constant variance
The second assumption implies also that the errors should not be correlated and the size of a given error should not depend on the explanatory variable.
We can study the validity of the model by plotting the following three diagnostic plots: Residuals vs fitted values, Normal QQ-plot and Residuals vs. leverage.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(points_model, which = c(1,2,5))
The assumption on the constant variance of the errors can be considered using a simple scatter plot of residuals vs. model predictions (fitted values). Any pattern in this plot implies problem. This plot appears to show no patterns, and we can with reasonable certainty accept the constant variance of errors assumption.
Normal QQ-plot allows us to study the normality of the errors. The plotted values should be reasonably close to the straight line. We can see from the plot that this is indeed the case.
From the residuals vs. leverage plot we can study the impact single observations have on the model. From the third plot we can see, that no single observation appears to stand out.
Thus with the help of these diagnostic plots we can conclude that the assumptions of the linear regression model are valid.
You can find my preprocessed data set and the data wrangling R script from my GitHub repository.
The alc data set contains data on student alcohol consumption and it is a joined data set from two data sets included in the Student Performance Data. You can learn more about the data here.
The data contains the combined results from two surveys conducted on students on mathematics and portuguese classes. This data was merged in the data wrangling part of this exercise. The variables not used to join the data sets were combined by averaging. Because we are interested in studying the alcohol use of students, we have defined two new variables:
The data contains 382 observations on 35 variables. Some of the values are numerical, some dichotomous, but there are also variables with different structures.
alc <- read.table("alc.csv", header = TRUE, sep = ";")
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
Now we will present a few hypothesis about the possible links between some variables in the data and alcohol consumption. First, as in the DataCamp exercises, it seems like a plausible thing to assume that sex, absences and failures are correlated with alcohol consumption. As a fourth candidate for we will choose going out (goout), with the idea being that students who go out partying a lot, also end up consuming more alcohol.
Let’s formally state the hypothesis:
All four hypothesis are associated with a positive correlation.
Now we want get some idea on how the chosen variables are distributed and what could be their relaionship with alcohol consumption.
Let’s start by taking a look at the distributions of the chosen variables and the two variables of interest. From the bar plots it is easy to see that in our sample alcohol consumption is strongly skewed towards low consumption and that the number of students who use a lot of alcohol is less than half of those students who use alcohol in a moderate manner. From the plots we can also see that there are slightly more female than male respondents, that absences are strongly skewed towards little of no absences with a few outliers, that most students pass their courses and that going out is the only variable that follows a distribution that even remotely resembles a normal distribution.
ggplot(data = alc, aes(x = alc_use)) + geom_bar()
ggplot(data = alc, aes(x = high_use)) + geom_bar()
ggplot(data = alc, aes(x = sex)) + geom_bar()
ggplot(data = alc, aes(x = absences)) + geom_bar()
ggplot(data = alc, aes(x = failures)) + geom_bar()
ggplot(data = alc, aes(x = goout)) + geom_bar()
We can also easily draw a plot that emphasizes the differences in alcohol consumption between male and female students. The plot offers support for our first hypothesis.
g2 <- ggplot(data = alc, aes(x = high_use))
g2 + geom_bar() + facet_wrap("sex")
First, we ignore gender, and cross-tabulate alcohol consumption to the means of the other three variables. We can see that the means of the other three variables could show an upward sloping trend. This is indeed what we would expect on the basis of our hypothesis.
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_absences = mean(absences), mean_failures = mean(failures), mean_goout = mean(goout))
## # A tibble: 9 x 5
## alc_use count mean_absences mean_failures mean_goout
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.0 140 3.357143 0.1071429 2.750000
## 2 1.5 69 4.231884 0.1449275 2.869565
## 3 2.0 59 3.915254 0.2203390 3.084746
## 4 2.5 44 6.431818 0.1363636 3.295455
## 5 3.0 32 6.093750 0.5625000 4.000000
## 6 3.5 17 5.647059 0.4705882 3.764706
## 7 4.0 9 6.000000 0.2222222 4.222222
## 8 4.5 3 12.000000 0.0000000 3.666667
## 9 5.0 9 6.888889 0.5555556 4.222222
If we include sex in the grouping variables, the trends in the means become somewhat more pronounced.
alc %>% group_by(alc_use, sex) %>% summarise(count = n(), mean_absences = mean(absences), mean_failures = mean(failures), mean_goout = mean(goout))
## # A tibble: 17 x 6
## # Groups: alc_use [?]
## alc_use sex count mean_absences mean_failures mean_goout
## <dbl> <fctr> <int> <dbl> <dbl> <dbl>
## 1 1.0 F 87 3.781609 0.10344828 2.896552
## 2 1.0 M 53 2.660377 0.11320755 2.509434
## 3 1.5 F 42 4.642857 0.04761905 2.857143
## 4 1.5 M 27 3.592593 0.29629630 2.888889
## 5 2.0 F 27 5.000000 0.25925926 3.296296
## 6 2.0 M 32 3.000000 0.18750000 2.906250
## 7 2.5 F 26 6.576923 0.19230769 2.961538
## 8 2.5 M 18 6.222222 0.05555556 3.777778
## 9 3.0 F 11 7.636364 0.54545455 4.090909
## 10 3.0 M 21 5.285714 0.57142857 3.952381
## 11 3.5 F 3 8.000000 0.33333333 3.333333
## 12 3.5 M 14 5.142857 0.50000000 3.857143
## 13 4.0 F 1 3.000000 0.00000000 4.000000
## 14 4.0 M 8 6.375000 0.25000000 4.250000
## 15 4.5 M 3 12.000000 0.00000000 3.666667
## 16 5.0 F 1 3.000000 0.00000000 5.000000
## 17 5.0 M 8 7.375000 0.62500000 4.125000
We can also do the cross-tabulations with “high-use” as the variable of interest. Here the difference between means for students that are heavy users of alcohol and those that are not, are quite evident.
alc %>% group_by(high_use, sex) %>% summarise(count = n(), mean_absences = mean(absences), mean_failures = mean(failures), mean_goout = mean(goout))
## # A tibble: 4 x 6
## # Groups: high_use [?]
## high_use sex count mean_absences mean_failures mean_goout
## <lgl> <fctr> <int> <dbl> <dbl> <dbl>
## 1 FALSE F 156 4.224359 0.1153846 2.955128
## 2 FALSE M 112 2.982143 0.1785714 2.714286
## 3 TRUE F 42 6.785714 0.2857143 3.357143
## 4 TRUE M 72 6.125000 0.3750000 3.930556
We will next draw box plots of “high_use” as a dependent variable. We will use “sex” as a colour visualization and the other three as explanatory variables. From the box plots we can see that our hypothesis nro 1, 2 and 4 seem plausible; high alcohol consumption appears to be more likely associated with the respondent being a male with relatively more absences and parties. However the box plot for failures is a bit perplexing. Perhaps because the variable was so heavily skewed to zero, the plot appears to suggest that there are only some outliers that are different fom zero.
g1 <- ggplot(alc, aes(x = high_use, col = sex, y = absences))
g1 + geom_boxplot() + ylab("absences")
g2 <- ggplot(alc, aes(x = high_use, col = sex, y = failures))
g2 + geom_boxplot() + ylab("failures")
g3 <- ggplot(alc, aes(x = high_use, col = sex, y = goout))
g3 + geom_boxplot() + ylab("goout")
In this part we will use logistic regression to gain more solid knowledge on the effect our chosen variables may have on the target variable. Here the target variable is the binary high/low alcohol consumption.
# find the model with glm()
m1 <- glm(high_use ~ sex + absences + failures + goout, data = alc, family = "binomial")
From the summary of the model we can immediately see that “failures” indeed are not statistically significant, which is in line with the box plot. Other than that, the results of the regression are in line with our hypothesis. Male students are more likely to be heavy users of alcohol, as are those with more absences and those who go out more.
# print out a summary of the model
summary(m1)
##
## Call:
## glm(formula = high_use ~ sex + absences + failures + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0299 -0.7890 -0.5417 0.7857 2.3588
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.13390 0.47292 -8.741 < 2e-16 ***
## sexM 0.91999 0.25625 3.590 0.000330 ***
## absences 0.08243 0.02235 3.688 0.000226 ***
## failures 0.34560 0.21093 1.638 0.101330
## goout 0.70797 0.11998 5.901 3.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 385.06 on 377 degrees of freedom
## AIC: 395.06
##
## Number of Fisher Scoring iterations: 4
Next we will present and interpret the coefficients of the model as odds ratios and calculate their confidence intervals.
Recall that the odds ratio is defined as the ratio of the odds of Y=TRUE for individuals with property X to the odds of Y=TRUE for individuals WITHOUT property X. Here Y would be high consumption of alcohol and X e.g. the respondent being male. In the ratios we have computed, the odds ratio for variable sexM thus gives the ratio the odds of a male being a heavy drinker to the odds of a female being a heavy drinker. We can see that the ratio is more than 2.5, or that men are more than 2.5 times more likely to be heavy drinkers.
As we are working with a logistic regression, we can interpret the odds ratios for non-binary variables as odds ratios between a unit change vs. no change in the corresponding explanatory variable. Thus the odds ratio of “goout”, 2.0, can be interpreted as the ratio between the odds of someone who goes to one more party being a heavy drinker and the odds of someone who doesn’t go to that party being a heavy drinker.
Note also that as all of the odds ratios calculated are higher than one, all our chosen variables are positively associated with high alcohol consumption.
From the confidence intervals we can see that all the intervals of all the statistically significant variables do not include zero, but the interval for “failures” does. That is, the effect of this variable on the likelihood of high alcohol consumption is statistically indifferent from zero.
# compute odds ratios (OR)
OR1 <- coef(m1) %>% exp
# compute confidence intervals (CI)
CI1 <- confint(m1) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR1, CI1)
## OR1 2.5 % 97.5 %
## (Intercept) 0.01602022 0.006086805 0.03902357
## sexM 2.50925591 1.526894419 4.17886012
## absences 1.08592493 1.040580166 1.13727912
## failures 1.41283378 0.934759890 2.14797583
## goout 2.02986872 1.613471240 2.58531654
In this part, we will explore the predictive power of a model with only statistically significant relationship with high alcohol consumption. Thus we will drop failures and fit a new model:
m2 <- glm(high_use ~ sex + absences + goout, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ sex + absences + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7871 -0.8153 -0.5446 0.8357 2.4740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16317 0.47506 -8.764 < 2e-16 ***
## sexM 0.95872 0.25459 3.766 0.000166 ***
## absences 0.08418 0.02237 3.764 0.000167 ***
## goout 0.72981 0.11970 6.097 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.75 on 378 degrees of freedom
## AIC: 395.75
##
## Number of Fisher Scoring iterations: 4
Now we will use the predict() function to make predictions from this model. We start by adding two columns to our data set: “probability” with the predicted probabilities and “prediction” which takes value TRUE if the value of “probability” is larger than 0.5.
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, sex, goout, high_use, probability, prediction) %>% tail(10)
## absences sex goout high_use probability prediction
## 373 0 M 2 FALSE 0.14869987 FALSE
## 374 7 M 3 TRUE 0.39514446 FALSE
## 375 1 F 3 FALSE 0.13129452 FALSE
## 376 6 F 3 FALSE 0.18714923 FALSE
## 377 2 F 2 FALSE 0.07342805 FALSE
## 378 2 F 4 FALSE 0.25434555 FALSE
## 379 2 F 2 FALSE 0.07342805 FALSE
## 380 3 F 1 FALSE 0.03989428 FALSE
## 381 4 M 5 TRUE 0.68596604 TRUE
## 382 2 M 1 TRUE 0.09060457 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 15
## TRUE 65 49
# tabulate the target variable versus the predictions with margings:
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.83246073 0.16753927 1.00000000
From the 2x2 cross tabulation of predictions versus the actual values we can see that our model is not perfect: in 15 cases it predicted a respondent to be a heavy user, when in fact the user was not, and vice versa in 65 cases. On the other hand, in 302 cases the prediction is correct. From the second table we can see the share of correct and incorrect predictions. In 21 % of the cases the prediction was incorrect, which seems like a reasonable fit.
We can also see this in a graph that plots the true values against the predictions:
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
We still want to compute the total proportion of inaccurately classified individuals, or the the training error. For this purpose we will define a loss function and then calculate the average number of wrong predictions in the training data.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.2984293
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7015707
loss_func(class = alc$high_use, alc$probability)
## [1] 0.2094241
That is, the training error is appr. 0.21 or 21 %. This seems like a reasonable fit.
Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a loss function is computed on data not used for finding the model. Cross-validation gives a good sense of the actual predictive power of the model.
# 10-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2172775
The average number of wrong predictions in the cross validation is 0.22 which is smaller than the 0.26 error in the model from DataCamp exercises. Thus my model has a slightly better test set performance than the example model.
The Boston dataset contains data from a study on housing values in suburbs of Boston. With the data you can for example study the effect of crime rates or air quality on the value of owner-ocupied homes. You can learn more about the data here.
The dataset contains 506 observations on 14 variables. The observations are numeric with most variables taking natural number values. There are also some with integer values and one dichotomous variable.
# load the data
data("Boston")
# glimpse at the dataset
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
Next we will look at the data in a bit more detail. Start by printing out the column/variable names and writing down the definitions of the different variables (from here):
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Variable | Definition |
---|---|
crim | per capita crime rate by town |
zn | proportion of residential land zoned for lots over 25,000 sq.ft |
indus | proportion of non-retail business acres per town |
chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
nox | nitrogen oxides concentration (parts per 10 million) |
rm | average number of rooms per dwelling |
age | proportion of owner-occupied units built prior to 1940 |
dis | weighted mean of distances to five Boston employment centres |
rad | index of accessibility to radial highways |
tax | full-value property-tax rate per $10,000 |
ptratio | pupil-teacher ratio by town |
black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
lstat | lower status of the population (percent) |
medv | median value of owner-occupied homes in $1000s |
From the summary of the variables we can see minimum, maximum, median and mean values as well as the 1st and 3rd quartiles of the variables.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
We can study and visualize the correlations between the different variables with the help of a correlations matrix and a correlations plot.
# First calculate the correlation matrix and round it so that it includes only two digits:
cor_matrix<-cor(Boston) %>% round(digits = 2)
# Print the correlation matrix:
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# Visualize the correlation matrix with a correlations plot:
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
From the plot above we can easily see which variables correlate with which and is that correlation positive or negative. So for example, we can see that the distance from the employment centres (dis) correlates negatively with age of the houses (age), nitrogen oxides concentration (nox) and proportion of non-retail business acres (indus). We can also see that accessibility to radial highways (rad) is highly positively correlated with property taxes (tax), as are industrial land use (indus) and air pollution (nox). On the first glimpse, the correlation appear quite intuitive and nothing stands out as very suprising. One could note that the correlation of the only dummy variable chas is almost negligible with all other variables. This could be because the overall share of observations where chas = 1 is very small.
We will later use linear discriminant analysis on our data, so we need to have varibles that are normally distributed and share the same covariance matrix across different classes. This requires the data to be scaled before fitting a model.
Thus we want to standardize the data set by scaling it. This is very easy to do in R by using the scale() function, which subtracts the mean of a variable from each value of the variable and divides this by the standard error. From the summary we can see that the standardized values are much more similar in magnitude across the different variables, so their relationships are easier to grasp. Note that the mean of each variable is now zero, as it should be. Also note that previously all variables were positive, but now at least the min values are negative for all variables. This is because the mean of all variables is larger than the smallest value of a variable.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Note that the scaled data set is now in matrix format. We want to change the class of the scaled data set into a dataframe, which we can do by converting it with a function as.data.frame():
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Next we want to create a categorical (or factor) variable of the crime rate in the scaled Boston data set. We will use the quantiles as the break points in the categorical variable. We will print the summary of the scaled crime rate, the quantile vector of this variable and a table of the new factor variable to check that everything goes as intended.
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
In order to avoid confusion, we want to drop the old crime rate variable from the dataset and replace it with the new categorical variable for crime rates. This is easily done with the following two lines of code:
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Lastly for this step, we will divide the dataset to train and test sets. This will allow as further down the line to check how well our model works. We will assign 80 % of the data to the train set and the remaining 20 % to the test set. The training of the model is done with the train set and predictions on new data is conducted on the test set.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Now we will fit the linear discriminant analysis on the train set. The LDA is a classification method that finds the (linear) combination of the variables that separate the target variable classes. We will use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2673267 0.2326733 0.2574257 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 0.8949519 -0.9062703 -0.126510540 -0.8639591 0.45121626
## med_low -0.1002056 -0.3311926 0.062743293 -0.5786949 -0.13361340
## med_high -0.3874684 0.2113804 0.257665195 0.4054934 0.08666805
## high -0.4872402 1.0149946 0.008892378 1.0154871 -0.45346741
## age dis rad tax ptratio
## low -0.8661588 0.8290926 -0.6873107 -0.7344032 -0.4370892
## med_low -0.3660405 0.3835351 -0.5432545 -0.4832759 -0.1057476
## med_high 0.4814412 -0.4204791 -0.4341408 -0.3260902 -0.3605327
## high 0.7900634 -0.8473921 1.6596029 1.5294129 0.8057784
## black lstat medv
## low 0.37307325 -0.77176361 0.53243100
## med_low 0.31789570 -0.13216633 0.02523482
## med_high 0.07054389 0.04851603 0.17531453
## high -0.85678592 0.87892067 -0.68679007
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11064629 0.50851943 -1.0794251
## indus 0.05698601 -0.30744412 0.2072748
## chas -0.05340348 -0.04092194 0.2014044
## nox 0.13397336 -0.77053254 -1.2072462
## rm -0.14887690 -0.07977527 -0.1655379
## age 0.27211892 -0.47851907 -0.1870665
## dis -0.14504293 -0.22011357 0.3343980
## rad 3.68193482 0.90265651 -0.2031073
## tax 0.09624445 0.08772477 0.6807715
## ptratio 0.12104699 0.05061285 -0.3051478
## black -0.15168803 0.02189388 0.1130713
## lstat 0.18716883 -0.20425751 0.4713960
## medv 0.17434448 -0.33140258 -0.1022168
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9548 0.0348 0.0104
The LDA calculates the probabilities of a new observation belonging to each of the classes by using the trained model. After that the model classifies each observation to the most probable class.
The most convenient and informative way to consider the results of an LDA is to visualize with a biplot. This we can do following the Datacamp exercise with the following code:
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The arrows in this plot represent the relationships of the original variables and the LDA solution.
In this part we want to see how the LDA model performs when predicting on new data. For this end, we can use the predict() function. we start by predicting the crime classes with the test data.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 6 0 0
## med_low 8 12 12 0
## med_high 3 4 13 2
## high 0 0 1 28
From the cross tabulation of the results we can see that the model appears to predict the correct results reasonably well. The model has most problems in separating med_low from low, but most of the predictions seems to fall in the correct class.
Next we will take few steps towards clustering the data. Start by reloading the Boston dataset and standardize the dataset again. We need to do this again, because we have made also other changes to dataset.
# load the data
data("Boston")
# center and standardize variables
boston_scaled2 <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled2)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled2)
## [1] "matrix"
# change the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled)
Next we wish to calculate the distances between the observations. We do this by forming a standard Euclidian distance matrix:
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
## Warning in dist(boston_scaled2): NAs introduced by coercion
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1394 3.5267 4.9081 4.9394 6.2421 13.0045
I ran into trouble here because there are NA values in the boston_scaled2 distance matrix. The k-means algorithm wouldn’t run on this dataset, retuning an error on the NA values. Despite googling I was not able to find a solution, so I’m doing the rest of this exercise with the original Boston dataset (as in the Datacamp exercises). So, a novel try:
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
From the summary we can see the relevant moments of the calculated distances between the observations.
Then let’s move on to K-means clustering, that assigns observations to groups or clusters based on similarity of the objects. The following code runs k-means algorithm on the dataset:
# k-means clustering
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
It’s difficult to see anything from this, so let’s zoom in a bit and look at only the last five columns:
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
Next we want to know what the optimal number of clusters is. We can do this by looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
From the plot above it’s evident that optimal number of clusters is not three, but two. Let’s run the k-means algorithm again with two clusters and then visualize the results:
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
Start by running the code below for the (scaled) train data that we used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
## Warning: package 'plotly' was built under R version 3.4.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Then we will use the package plotly to create a 3D plot of the columns of the matrix product:
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
This is very cool indeed! But addingdifferent colors for the different crime rate classes definetly makes it even cooler still:
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
Lastly, we are supposed to define the color by the clusters of the k-means. Here I must be doing something wrong, because the colors disappear completely.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km)
You can find my preprocessed data set and the data wrangling R script from my GitHub repository.
You can find my preprocessed data set and the data wrangling R script from my GitHub repository.
In this week assignment we are working with data from the the Human Development Index (HDI) and the Gender Inequality Index (GII), which are both compiled by the United Nations Development programme. You can learn more about the data here on HDI and here on GII.
The dataset we will use was combined and modified in the data wrangling part of this assignment.
We will start by loading the preprocessed “human” data into R and taking a look at it’s general appearance.
# load the data
human <- read.table(file = "human.csv", sep = ",", header = TRUE)
# glimpse at the dataset
glimpse(human)
## Observations: 155
## Variables: 8
## $ Edu2.FM <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.969060...
## $ Labo.FM <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.828611...
## $ Life.Exp <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0...
## $ Edu.Exp <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9...
## $ GNI <int> 166, 135, 156, 139, 140, 137, 127, 154, 134, 117, 18...
## $ Mat.Mor <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 1...
## $ Ado.Birth <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3...
## $ Parli.F <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2...
From glimpsing the data we can see that it contains 155 observations on 8 variables. We can also see that all of the variables have numerical values. Every row in the data is associated with one country, thus we have observations from 155 different countries. From the webpages linked above we can see that most of the variables are observed in the year 2015, but the variable that measures the secondary education is measured over years 2005-2015.
Variable | Definition |
---|---|
Edu2.FM | The ratio between the shares of female and male population with at least some secondary education |
Labo.FM | The ratio between the shares of female and male labour force participation rate |
Edu.Exp | Expected years of schooling, years |
Life.Exp | Life expectancy at birth, years |
GNI | Gross national income (GNI) per capita, 2011 PPP $ |
Mat.Mor | Maternal mortality ratio, deaths per 100,000 live births |
Ado.Birth | Adolescent birth rate, births per 1,000 women ages 15-19 |
Parli.F | Share of seats in parliament, % held by women |
From the summary of the variables we can see minimum, maximum, median and mean values as well as the 1st and 3rd quartiles of the variables. By looking at the minimum and maximum values we can see that the countries in our data set differ quite a lot with respect to all eight variables.
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 2.00 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 53.50 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 99.00 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 98.73 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.:143.50 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :194.00 Max. :1100.0 Max. :204.80 Max. :57.50
Now, in order to gain an oversight on the distributions of the variables and the relationships between the different variables, we can draw a ggpairs-plot. This plot contains so much information! We can for example immediately see, that most of the countries in our sample mostly exhibit very low rates of maternal mortality. Only the distribution of expected years of education appears to be quite close to the normal distribution, the distributions of other variables are very clearly skewed or have high kurtosis such as the distribution of Edu2.FM.
ggpairs(human)
From the above plot we can study the correlations between the variables, but we can do this another way also. We can study and visualize the correlations between the different variables with the help of a correlations matrix and a correlations plot.
# First calculate the correlation matrix and round it so that it includes only two digits:
cor_matrix<-cor(human) %>% round(digits = 2)
# Print the correlation matrix:
cor_matrix
## Edu2.FM Labo.FM Life.Exp Edu.Exp GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM 1.00 0.01 0.58 0.59 0.18 -0.66 -0.53 0.08
## Labo.FM 0.01 1.00 -0.14 0.05 -0.01 0.24 0.12 0.25
## Life.Exp 0.58 -0.14 1.00 0.79 0.19 -0.86 -0.73 0.17
## Edu.Exp 0.59 0.05 0.79 1.00 0.12 -0.74 -0.70 0.21
## GNI 0.18 -0.01 0.19 0.12 1.00 -0.10 -0.13 0.00
## Mat.Mor -0.66 0.24 -0.86 -0.74 -0.10 1.00 0.76 -0.09
## Ado.Birth -0.53 0.12 -0.73 -0.70 -0.13 0.76 1.00 -0.07
## Parli.F 0.08 0.25 0.17 0.21 0.00 -0.09 -0.07 1.00
# Visualize the correlation matrix with a correlations plot:
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
From the plot above, we can clearly and easily see the correlations between the variables. Note for example that expected years of education are highly positively correlated with life expectancy. This most probably is related to the fact that these both are correlated positively with the countries’ GNI. Also maternal mortality and adolescent birth rate are strongly positively correlated, but here the undertone is completely the opposite and of more sinister nature. In those countries, were adolescent girls are more likely to be mothers, i.e. they don’t have access to contraceptives or are forced to marry young, expectant mothers are also less likely to have access to good quality health care. These countries are likely to be low-income countries, which is also confirmed by the correlations plot. All in all, it’s clear from the plot that high maternal mortality and adolescent birth rate are signals of other problems in a country also. Those countries also have lower education and life expectancies, less income per capita and less gender equality in education. Interestingly, share of women in the parliament and the difference between labor force participation between the sexes appear to be not that much correlated with the other variables.
In this sectin we will performa principal component analysis (PCA) using the singular value decomposition (SVD), which is the most important tool for reducing the number of dimensions in multivariate data. Crucially, we will perform PCA on non-standardized data. Recall from part 2 that the values taken by the variables are very different in magnitude. Especially the GNI takes very large values compared to the other variables. We will see why the standardization of data is so important when using PCA.
Recall that in PCA the first principal component captures the maximum amout variance from the the features in the orignal data, the second PC captures the maximum amount of variability left from the first PC, the third captures the maximum amount of variability left from the second PC and so on. All principal components are uncorrelated and each successive component is less important in terms of captured variance.
By printing the results of the PCA we can see that standard deviations (or variablity) of the principal components are very different in magnitude. From the summary of the PCA results and the percentages of variance captured by each PC we can see that almost all of the variability in the original features is captured by the first principal component. A good hunch is that we will run into trouble in plotting the biplot.
# perform principal component analysis (with the SVD method) and print out the results
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 214.3202186 54.4858892 26.3814123 11.4791149 4.0668732 1.6067062
## [7] 0.1905111 0.1586732
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## Edu2.FM -0.0007468424 4.707363e-04 -0.0001689810 -0.0004252438
## Labo.FM 0.0002210928 5.783875e-05 -0.0007563442 -0.0047679411
## Life.Exp -0.0334485095 1.573401e-02 -0.0303860782 -0.0780112042
## Edu.Exp -0.0098109042 2.441895e-03 -0.0223269972 -0.0369246053
## GNI -0.0278166075 9.979413e-01 0.0557767706 -0.0002740466
## Mat.Mor 0.9879825869 3.620190e-02 -0.1474236772 -0.0068326906
## Ado.Birth 0.1479114018 -5.046353e-02 0.9867754042 -0.0069374879
## Parli.F -0.0048164618 -1.494485e-03 -0.0026652825 -0.9962093264
## PC5 PC6 PC7 PC8
## Edu2.FM 0.0002964017 -0.0254740423 6.859097e-01 7.272399e-01
## Labo.FM -0.0034895150 -0.0320024527 7.265303e-01 -6.863629e-01
## Life.Exp -0.9759219723 0.1979058277 5.461806e-03 2.081465e-03
## Edu.Exp -0.1947276781 -0.9790080609 -4.054956e-02 3.992920e-03
## GNI 0.0150268685 -0.0014692010 -3.631899e-04 -3.767644e-04
## Mat.Mor -0.0282554922 -0.0005722709 1.213189e-04 8.299757e-04
## Ado.Birth -0.0393039689 -0.0160315111 -4.359137e-05 -9.463203e-05
## Parli.F 0.0841200882 0.0210694313 -2.695182e-03 2.658636e-03
# create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 214.3202 54.48589 26.38141 11.47911 4.06687 1.60671
## Proportion of Variance 0.9233 0.05967 0.01399 0.00265 0.00033 0.00005
## Cumulative Proportion 0.9233 0.98298 0.99697 0.99961 0.99995 1.00000
## PC7 PC8
## Standard deviation 0.1905 0.1587
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion 1.0000 1.0000
# calculate rounded percentages of variance captured by each PC and print out the percentages of variance
pca_pr <- round(100*s$importance[2,], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 92.3 6.0 1.4 0.3 0.0 0.0 0.0 0.0
This is indeed the case. Recall that a biplot includes a scatter plot between the first two PC’s and the original features and their relationships with both each other and the principal components. As all variability is captured by PC1 and none by PC2, the realtionships between the original features and the principal components do not really show. Indeed, as the length of the arrows in the biplot are proportional to the standard deviations of the original features, we only see in the biplot an arrow representing GNI, that took values many times larger than the other variables. Summa summarum, there isn’t much we can learn from this plot.
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "Biplot of the two principal components of the unscaled human data")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Next we will perform similar analysis on the standardized data.
Start by standardizing the data. This is very easily done using the scale() function. Recall that standardizing is equal to taking the difference between the observation of the variable and the mean of the variable and dividing it by the standard deviation of the variable. From the summary of the standardized variables we can now see that the values of each variable are much more in line with each other than previously.
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-1.767729 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.826563 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median : 0.004952 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.000000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.818192 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 1.741082 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
Now we can perform the principal component analysis and get some sensible results. From the results we can immediately see that they are comparable. We can also see that the variability captured by each of the PC is much more evenly distributed now. Still, it is clear that PC1 is the most important one, as it captures 53,6 % of the variance in the original features.
# perform principal component analysis (with the SVD method) and print out the results
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 1.9658091 1.1387842 0.9899881 0.8659777 0.6993077 0.5400078 0.4670082
## [8] 0.3317153
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.38438136 0.04973416 -0.08668975 0.31301506 -0.82578999
## Labo.FM 0.06452258 0.72023430 -0.10293459 0.59483194 0.20499428
## Life.Exp -0.46758272 -0.01554845 0.02106889 -0.09983270 0.19702138
## Edu.Exp -0.44611813 0.14839516 0.06836143 0.09857478 0.22718207
## GNI -0.11121526 -0.01842917 -0.97298099 -0.16503899 0.07296310
## Mat.Mor 0.47036562 0.12782929 -0.11999280 0.04185463 0.04609075
## Ado.Birth 0.43435393 0.06675352 -0.06553107 -0.05173116 -0.38364001
## Parli.F -0.09031518 0.65984104 0.10671258 -0.70487388 -0.17604400
## PC6 PC7 PC8
## Edu2.FM 0.139911426 0.100381612 0.18084014
## Labo.FM 0.008135958 -0.256990684 -0.06742229
## Life.Exp -0.393884789 -0.421527811 0.63171658
## Edu.Exp -0.415647904 0.716119209 -0.16542557
## GNI 0.005876052 0.009858605 -0.08891751
## Mat.Mor 0.085633294 0.472803692 0.71642517
## Ado.Birth -0.792816131 -0.096488804 -0.12191407
## Parli.F 0.128550324 0.020403204 -0.01688909
# create and print out a summary of pca_human
s_std <- summary(pca_human_std)
s_std
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.966 1.1388 0.9900 0.86598 0.69931 0.54001 0.46701
## Proportion of Variance 0.483 0.1621 0.1225 0.09374 0.06113 0.03645 0.02726
## Cumulative Proportion 0.483 0.6452 0.7677 0.86140 0.92253 0.95898 0.98625
## PC8
## Standard deviation 0.33172
## Proportion of Variance 0.01375
## Cumulative Proportion 1.00000
# calculate rounded percentages of variance captured by each PC and print out the percentages of variance
pca_pr_std <- round(100*s_std$importance[2,], digits = 1)
pca_pr_std
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 48.3 16.2 12.3 9.4 6.1 3.6 2.7 1.4
Let’s see how our results now look in the biplot.
# create object pc_lab to be used as axis labels
pc_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")
# draw a biplot
biplot(pca_human_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_std[1], ylab = pc_lab_std[2], main = "Biplot of the two principal components of the scaled human data")
From the two biplots above we can clearly see that standardizing the variables has a big effect on the results. The first biplot with unscaled data gave us very little information on the original features. Recall that PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance. This is why standardizing the data is so important.
Our data set has observations on 8 different variables, that is, it has 8 dimensions. Just by looking at all the different plots in say the ggpairs plot above, it could be very hard to tease out interesting details in the data. Not that much stands out when you have a heapload of scatterplots! But with PCA and a biplot, we can e.g. note that there is something different about say Rwanda, that is a lonely outlier in out plot. We can also immediately see that three of our eight variables (Parl.F, Labo.FM and GNI) are not correlated with the other variables and the first PC, but instead they are highly positively correlated with the second PC (the angle between the arrows and the PC2 axis is very small). The five other variables are correlated with the first PC.
From the second biplot we can clearly see a confirmation for what we saw in the correlations plot: share of female parliamentary representation, differences in female and male labor force participation and GNI per capita are not really correlated with the other five variables. Also note that the length of the arrow of GNI is much shorter than the arrows of the others (recall that the length of the arrows is proportional to the standard deviations of the features).
We can also see that the these three variables are quite clearly correlated with the second PC, whereas the other six are correlated with the first PC. Also note that the length of the arrows of these six variables is in general longer than that of the three correlated with PC2. This reflects the difference in their relative importance.
Load the tea dataset from the package Factominer. Explore the data briefly: look at the structure and the dimensions of the data and visualize it. Then do Multiple Correspondence Analysis on the tea data (or to a certain columns of the data, it’s up to you). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots.
Start by installing the Factominer package and accessing the library (code not shown here). Then download the “tea” dataset and take a glimpse on the data. We have 300 observations on 36 variables. Note that all the variables are factor or categorical variables, which is essential for MCA.
# load the data
data("tea")
# glimpse at the dataset
glimpse(tea)
## Observations: 300
## Variables: 36
## $ breakfast <fctr> breakfast, breakfast, Not.breakfast, Not.bre...
## $ tea.time <fctr> Not.tea time, Not.tea time, tea time, Not.te...
## $ evening <fctr> Not.evening, Not.evening, evening, Not.eveni...
## $ lunch <fctr> Not.lunch, Not.lunch, Not.lunch, Not.lunch, ...
## $ dinner <fctr> Not.dinner, Not.dinner, dinner, dinner, Not....
## $ always <fctr> Not.always, Not.always, Not.always, Not.alwa...
## $ home <fctr> home, home, home, home, home, home, home, ho...
## $ work <fctr> Not.work, Not.work, work, Not.work, Not.work...
## $ tearoom <fctr> Not.tearoom, Not.tearoom, Not.tearoom, Not.t...
## $ friends <fctr> Not.friends, Not.friends, friends, Not.frien...
## $ resto <fctr> Not.resto, Not.resto, resto, Not.resto, Not....
## $ pub <fctr> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub,...
## $ Tea <fctr> black, black, Earl Grey, Earl Grey, Earl Gre...
## $ How <fctr> alone, milk, alone, alone, alone, alone, alo...
## $ sugar <fctr> sugar, No.sugar, No.sugar, sugar, No.sugar, ...
## $ how <fctr> tea bag, tea bag, tea bag, tea bag, tea bag,...
## $ where <fctr> chain store, chain store, chain store, chain...
## $ price <fctr> p_unknown, p_variable, p_variable, p_variabl...
## $ age <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 3...
## $ sex <fctr> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M,...
## $ SPC <fctr> middle, middle, other worker, student, emplo...
## $ Sport <fctr> sportsman, sportsman, sportsman, Not.sportsm...
## $ age_Q <fctr> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35...
## $ frequency <fctr> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, ...
## $ escape.exoticism <fctr> Not.escape-exoticism, escape-exoticism, Not....
## $ spirituality <fctr> Not.spirituality, Not.spirituality, Not.spir...
## $ healthy <fctr> healthy, healthy, healthy, healthy, Not.heal...
## $ diuretic <fctr> Not.diuretic, diuretic, diuretic, Not.diuret...
## $ friendliness <fctr> Not.friendliness, Not.friendliness, friendli...
## $ iron.absorption <fctr> Not.iron absorption, Not.iron absorption, No...
## $ feminine <fctr> Not.feminine, Not.feminine, Not.feminine, No...
## $ sophisticated <fctr> Not.sophisticated, Not.sophisticated, Not.so...
## $ slimming <fctr> No.slimming, No.slimming, No.slimming, No.sl...
## $ exciting <fctr> No.exciting, exciting, No.exciting, No.excit...
## $ relaxing <fctr> No.relaxing, No.relaxing, relaxing, relaxing...
## $ effect.on.health <fctr> No.effect on health, No.effect on health, No...
We will follow the DataCamp exercises and restrict our analysis on six columns in the dataset. We will thus have have 300 observations on 6 variables.
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
glimpse(tea_time)
## Observations: 300
## Variables: 6
## $ Tea <fctr> black, black, Earl Grey, Earl Grey, Earl Grey, Earl Gre...
## $ How <fctr> alone, milk, alone, alone, alone, alone, alone, milk, m...
## $ how <fctr> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag, t...
## $ sugar <fctr> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar, N...
## $ where <fctr> chain store, chain store, chain store, chain store, cha...
## $ lunch <fctr> Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch, ...
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
We can visualize the data using bar plots. From the plots we can for example note that most tea drinkers surveyed here completely miss the point of drinking tea: Earl Grey from a tea bag bought from a chain store is basically an insult to tea.
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Next we want to do a Multiple Correspondence Analysis (MCA) on the tea data. Recall that MCA is a method to analyze qualitative data and it can be used to detect patterns or structure in the data as well as in dimension reduction.
The MCA will produce a matrix of two-way cross-tabulations between all the variables in the dataset, and then present the information of the cross-tabulations in a clear graphical form. In the MCA factor maps the variables are drawn on the first two dimensions discovered in the MCA. In interpreting the graphical results recall that the distance between the variable categories gives a measure of their similarity.
The default plots in a sense make groups of the observations and gove some indication on their relationships. In the second plot we have all the observations plotted. In the first plot these are categorized according to the value of the observation. In the third plot the picture is simplified even more by combining all the observations on a single variable into one data point. From the default plots we can e.g. see that variables “where” (where is tea bought from) and “how” (is tea bought in bags or unpackaged form) are closely related, which is not surprising. These two variables are however distinct from the other four variables.
# multiple correspondence analysis
mca <- MCA(tea_time)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
Next we can make the first of the above plots a bit more informative by adding the argument habillage = “quali”. This argument adds a color indicating the observations that are on the same variable. From this plot we can easily see for example the relationship between where tea is bought and in what form it is bought.
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")