Chapter 1: Tools and methods for open and reproducible research
- RStudio Exercise 1: About the project
Chapter 2: Regression and model validation
Chapter 3: Logistic regression
Chapter 4: Clustering and classification
Chapter 5: Dimensionality reduction techniques

Chapter 1: Tools and methods for open and reproducible research

RStudio Exercise 1: About the project

Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.

Introduction to Open Data Science

This course seems like a lot of fun.

My GitHub repository

My GitHub repository can be found here.

Chapter 2: Regression and model validation

RStudio exercise 2: Data wrangling and analysis

Data wrangling

You can find my preprocessed data set and the data wrangling R script from my GitHub repository.

Analysis

1. Data

The learning2014 dataset has been collected as an international survey of approaches to learning (you can find more information on it from here.) during 3.12.2014 - 10.1.2015. It contains 166 observations on 7 variables.

learning2014 <- read.table("learning2014.csv", header = TRUE, sep = "\t")
dim(learning2014)

## [1] 166   7

str(learning2014)

## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The variables are gender, age, attitude, deep, stra, surf and points.

Variable	Definition	Value
age	age (in years) derived from the date of birth	integer
gender	gender	M (Male), F (Female)
attitude	global attitude toward statistics	1 to 5 on the Likert-scale
deep	questions related to deep learning	natural number
stra	questions related to strategic learning	natural number
surf	questions related to surface learning	natural number
points	exam points	integer

2. A more detailed view on the data

A graphical overview:

It seems like a fairly plausible assumption that a student’s attitude towards statistics has a big impact on exam success. We can start by plotting the variables attitude and points as a twoway scatter plot with a color indicator for gender and regression lines. (I use option include=FALSE when accessing the libraries to remove the warning messages by R from the output.)

# initialize plot with data and aesthetic mapping, define the visualization type (points), add a regression line and add a main title:
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col = gender)) + geom_point() + geom_smooth(method = "lm")  + ggtitle("Student's attitude versus exam points")

# draw the plot
p1

Next let’s consider a more detailed plot describing the variables, their distributions and relationships. This plot contains a heapload of information! First, we can infact see that the hypothesis on the relation between attitude and exam points was plausible, as the correlation between points and attitude is indeed stonger than with points and any other variable.

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

Summary of the variables

From the summary of the data we can see the min, max, median and mean values as well as the 1st and 3rd quintiles of all the variables. One can note that the sample is biased towards female respondents, that the respondents are mostly in their early twenties and have a generally more positive than negative view of statistics as a discipline. The median and mean values of the learning variables deep, stra and surf show that the respondents are more likely to engage in deep learning than surface learning.

summary(learning2014)

##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

3. A regression model

Next let’s see if some of the hypothesis made above hold up against a bit more sophisticated analysis. We will consider exam points as the dependent variable and the aim is to fit a regression model to the data that would explain what affects the number of exam points.

From the plots above, we can deduct that attitude, stra and surf have the highest correlations with points in absolute terms. Let’s thus choose these as our explanatory variables and fit the following linear model with multiple explanatory variables:

# create a regression model with multiple explanatory variables
points_mmodel <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out a summary of the model
summary(points_mmodel)

## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

From the summary of the model we can see the results. Under “call” we can see a recap of the model. Under “residuals” the summary prints out important information on the residuals. Recall that the residuals of a linear regressin model are assumed to be normally distributed with a zero mean and a constant variance. We can see that the median is quite close to zero, but we should conduct further statistical tests to be certain that the mean is statistically indifferent from zero.

Under coefficients we have the estimated effects of the explanatory variables on the dependent variable, their standard errors, t-values and p-values. The last two values are maybe the most important, as they tell the value of the estimates. From the printout we can immediately see that only the effects of intercept and attitude on the dependent variable are statistically significant and different from zero. The effects stra and surf are not and should be removed from the model.

Let’s then fit the model again without the two statistically insignificant variables, that is, as a simple linear regression:

# create a regression model with multiple explanatory variables
points_model <- lm(points ~ attitude, data = learning2014)

# print out a summary of the model
summary(points_model)

## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now we can see that the median of residuals is smaller than in the multiple regression, which is a good thing. We can also see that both the effect of intercept and of attitude are now highly statistically significant, whereas before when we included also the other two variables, the intercept was slightly less significant.

4. Interpreting the regression results

From the summary of our simple linear regression model we can see that the estimated effect of attitude on points is 3.5255. This means that a change of one unit in attitude (and all other variables are held constant) corresponds to a 3.5255 increase in the exam points. Even more concretely, a student with an attitude score of 4 is expected to score 3.5255 points more on the exam than a student with an attitude score of 3. This result clearle validates the hypothesis made in part 2 of attitude towards statistics being an important predictor of success in the exam.

The “multiple R-squared” printed as a part of the summary gives the coefficient of determination, or the proportion of variance in the value of the dependent variable that is predicted by the model. In our model the intercept is included, which means that R-squared is the square of the sample correlation coefficient between the observed outcomes and the predictor values. The value of R-squared in our model is 0.1906, which means that 19.06 % of the variance of the dependent variable is explained by our model. The remaining variance is explained some other factors that we have not included in our model.

Notice that R-squared is higher in the multiple regression model, even though the additional explanatory variables are insignificant. This is because the R-squared automatically increases when any explanatory variables are added to a model. To account for this, R prints also “adjusted R-squared” that adjusts for the number of explanatory variables relative to the number of data points.

5. Diagnostic plots

When using a linear regression model, we make two important assumptions on the data generating process: - the target variable is a linear combination of the model parameters - errors are normally distributed with zero mean and a constant variance

The second assumption implies also that the errors should not be correlated and the size of a given error should not depend on the explanatory variable.

We can study the validity of the model by plotting the following three diagnostic plots: Residuals vs fitted values, Normal QQ-plot and Residuals vs. leverage.

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(points_model, which = c(1,2,5))

The assumption on the constant variance of the errors can be considered using a simple scatter plot of residuals vs. model predictions (fitted values). Any pattern in this plot implies problem. This plot appears to show no patterns, and we can with reasonable certainty accept the constant variance of errors assumption.

Normal QQ-plot allows us to study the normality of the errors. The plotted values should be reasonably close to the straight line. We can see from the plot that this is indeed the case.

From the residuals vs. leverage plot we can study the impact single observations have on the model. From the third plot we can see, that no single observation appears to stand out.

Thus with the help of these diagnostic plots we can conclude that the assumptions of the linear regression model are valid.

Chapter 3: Logistic regression

RStudio exercise 3: Data wrangling and analysis

Data wrangling

You can find my preprocessed data set and the data wrangling R script from my GitHub repository.

Analysis

2. Data

The alc data set contains data on student alcohol consumption and it is a joined data set from two data sets included in the Student Performance Data. You can learn more about the data here.

The data contains the combined results from two surveys conducted on students on mathematics and portuguese classes. This data was merged in the data wrangling part of this exercise. The variables not used to join the data sets were combined by averaging. Because we are interested in studying the alcohol use of students, we have defined two new variables:

“alc_use” which is the average of “Dalc” and “Walc”, i.e. alcohol use during working days and during weekends. It takes values that are natural numbers.
“high_use” which is TRUE if “alc_use” is higher than 2 and FALSE otherwise. That is, it is a dichotomous variable.

The data contains 382 observations on 35 variables. Some of the values are numerical, some dichotomous, but there are also variables with different structures.

alc <- read.table("alc.csv", header = TRUE, sep = ";")
glimpse(alc)

## Observations: 382
## Variables: 35
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob       <fctr> teacher, other, other, services, other, other, oth...
## $ reason     <fctr> course, course, other, home, home, reputation, hom...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

3. Hypothesis

Now we will present a few hypothesis about the possible links between some variables in the data and alcohol consumption. First, as in the DataCamp exercises, it seems like a plausible thing to assume that sex, absences and failures are correlated with alcohol consumption. As a fourth candidate for we will choose going out (goout), with the idea being that students who go out partying a lot, also end up consuming more alcohol.

Let’s formally state the hypothesis:

Male students are more likely to consume more alcohol than female students.
Students that consume more alcohol are more likely to gain absences than those who consume less.
Students that consume more alcohol are more likely to fail courses than those who consume less.
Students that go out more are more likely to consume more alcohol.

All four hypothesis are associated with a positive correlation.

4. The distributions of the chosen variables and their relationships with alcohol consumption

Now we want get some idea on how the chosen variables are distributed and what could be their relaionship with alcohol consumption.

Bar plots

Let’s start by taking a look at the distributions of the chosen variables and the two variables of interest. From the bar plots it is easy to see that in our sample alcohol consumption is strongly skewed towards low consumption and that the number of students who use a lot of alcohol is less than half of those students who use alcohol in a moderate manner. From the plots we can also see that there are slightly more female than male respondents, that absences are strongly skewed towards little of no absences with a few outliers, that most students pass their courses and that going out is the only variable that follows a distribution that even remotely resembles a normal distribution.

ggplot(data = alc, aes(x = alc_use)) + geom_bar()

ggplot(data = alc, aes(x = high_use)) + geom_bar()

ggplot(data = alc, aes(x = sex)) + geom_bar()

ggplot(data = alc, aes(x = absences)) + geom_bar()

ggplot(data = alc, aes(x = failures)) + geom_bar()

ggplot(data = alc, aes(x = goout)) + geom_bar()

We can also easily draw a plot that emphasizes the differences in alcohol consumption between male and female students. The plot offers support for our first hypothesis.

g2 <- ggplot(data = alc, aes(x = high_use))
g2 + geom_bar() + facet_wrap("sex")

Cross-tabulations

First, we ignore gender, and cross-tabulate alcohol consumption to the means of the other three variables. We can see that the means of the other three variables could show an upward sloping trend. This is indeed what we would expect on the basis of our hypothesis.

alc %>% group_by(alc_use) %>% summarise(count = n(), mean_absences = mean(absences), mean_failures = mean(failures), mean_goout = mean(goout))

## # A tibble: 9 x 5
##   alc_use count mean_absences mean_failures mean_goout
##     <dbl> <int>         <dbl>         <dbl>      <dbl>
## 1     1.0   140      3.357143     0.1071429   2.750000
## 2     1.5    69      4.231884     0.1449275   2.869565
## 3     2.0    59      3.915254     0.2203390   3.084746
## 4     2.5    44      6.431818     0.1363636   3.295455
## 5     3.0    32      6.093750     0.5625000   4.000000
## 6     3.5    17      5.647059     0.4705882   3.764706
## 7     4.0     9      6.000000     0.2222222   4.222222
## 8     4.5     3     12.000000     0.0000000   3.666667
## 9     5.0     9      6.888889     0.5555556   4.222222

If we include sex in the grouping variables, the trends in the means become somewhat more pronounced.

alc %>% group_by(alc_use, sex) %>% summarise(count = n(), mean_absences = mean(absences), mean_failures = mean(failures), mean_goout = mean(goout))

## # A tibble: 17 x 6
## # Groups:   alc_use [?]
##    alc_use    sex count mean_absences mean_failures mean_goout
##      <dbl> <fctr> <int>         <dbl>         <dbl>      <dbl>
##  1     1.0      F    87      3.781609    0.10344828   2.896552
##  2     1.0      M    53      2.660377    0.11320755   2.509434
##  3     1.5      F    42      4.642857    0.04761905   2.857143
##  4     1.5      M    27      3.592593    0.29629630   2.888889
##  5     2.0      F    27      5.000000    0.25925926   3.296296
##  6     2.0      M    32      3.000000    0.18750000   2.906250
##  7     2.5      F    26      6.576923    0.19230769   2.961538
##  8     2.5      M    18      6.222222    0.05555556   3.777778
##  9     3.0      F    11      7.636364    0.54545455   4.090909
## 10     3.0      M    21      5.285714    0.57142857   3.952381
## 11     3.5      F     3      8.000000    0.33333333   3.333333
## 12     3.5      M    14      5.142857    0.50000000   3.857143
## 13     4.0      F     1      3.000000    0.00000000   4.000000
## 14     4.0      M     8      6.375000    0.25000000   4.250000
## 15     4.5      M     3     12.000000    0.00000000   3.666667
## 16     5.0      F     1      3.000000    0.00000000   5.000000
## 17     5.0      M     8      7.375000    0.62500000   4.125000

We can also do the cross-tabulations with “high-use” as the variable of interest. Here the difference between means for students that are heavy users of alcohol and those that are not, are quite evident.

alc %>% group_by(high_use, sex) %>% summarise(count = n(), mean_absences = mean(absences), mean_failures = mean(failures), mean_goout = mean(goout))

## # A tibble: 4 x 6
## # Groups:   high_use [?]
##   high_use    sex count mean_absences mean_failures mean_goout
##      <lgl> <fctr> <int>         <dbl>         <dbl>      <dbl>
## 1    FALSE      F   156      4.224359     0.1153846   2.955128
## 2    FALSE      M   112      2.982143     0.1785714   2.714286
## 3     TRUE      F    42      6.785714     0.2857143   3.357143
## 4     TRUE      M    72      6.125000     0.3750000   3.930556

Box plots

We will next draw box plots of “high_use” as a dependent variable. We will use “sex” as a colour visualization and the other three as explanatory variables. From the box plots we can see that our hypothesis nro 1, 2 and 4 seem plausible; high alcohol consumption appears to be more likely associated with the respondent being a male with relatively more absences and parties. However the box plot for failures is a bit perplexing. Perhaps because the variable was so heavily skewed to zero, the plot appears to suggest that there are only some outliers that are different fom zero.

g1 <- ggplot(alc, aes(x = high_use, col = sex, y = absences))
g1 + geom_boxplot() + ylab("absences")

g2 <- ggplot(alc, aes(x = high_use, col = sex, y = failures))
g2 + geom_boxplot() + ylab("failures")

g3 <- ggplot(alc, aes(x = high_use, col = sex, y = goout))
g3 + geom_boxplot() + ylab("goout")

5. Logistic regression

In this part we will use logistic regression to gain more solid knowledge on the effect our chosen variables may have on the target variable. Here the target variable is the binary high/low alcohol consumption.

# find the model with glm()
m1 <- glm(high_use ~ sex + absences + failures + goout, data = alc, family = "binomial")

From the summary of the model we can immediately see that “failures” indeed are not statistically significant, which is in line with the box plot. Other than that, the results of the regression are in line with our hypothesis. Male students are more likely to be heavy users of alcohol, as are those with more absences and those who go out more.

# print out a summary of the model
summary(m1)

## 
## Call:
## glm(formula = high_use ~ sex + absences + failures + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0299  -0.7890  -0.5417   0.7857   2.3588  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.13390    0.47292  -8.741  < 2e-16 ***
## sexM         0.91999    0.25625   3.590 0.000330 ***
## absences     0.08243    0.02235   3.688 0.000226 ***
## failures     0.34560    0.21093   1.638 0.101330    
## goout        0.70797    0.11998   5.901 3.62e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 385.06  on 377  degrees of freedom
## AIC: 395.06
## 
## Number of Fisher Scoring iterations: 4

Odds ratios and confidence intervals

Next we will present and interpret the coefficients of the model as odds ratios and calculate their confidence intervals.

Recall that the odds ratio is defined as the ratio of the odds of Y=TRUE for individuals with property X to the odds of Y=TRUE for individuals WITHOUT property X. Here Y would be high consumption of alcohol and X e.g. the respondent being male. In the ratios we have computed, the odds ratio for variable sexM thus gives the ratio the odds of a male being a heavy drinker to the odds of a female being a heavy drinker. We can see that the ratio is more than 2.5, or that men are more than 2.5 times more likely to be heavy drinkers.

As we are working with a logistic regression, we can interpret the odds ratios for non-binary variables as odds ratios between a unit change vs. no change in the corresponding explanatory variable. Thus the odds ratio of “goout”, 2.0, can be interpreted as the ratio between the odds of someone who goes to one more party being a heavy drinker and the odds of someone who doesn’t go to that party being a heavy drinker.

Note also that as all of the odds ratios calculated are higher than one, all our chosen variables are positively associated with high alcohol consumption.

From the confidence intervals we can see that all the intervals of all the statistically significant variables do not include zero, but the interval for “failures” does. That is, the effect of this variable on the likelihood of high alcohol consumption is statistically indifferent from zero.

# compute odds ratios (OR)
OR1 <- coef(m1) %>% exp

# compute confidence intervals (CI)
CI1 <- confint(m1) %>% exp

## Waiting for profiling to be done...

# print out the odds ratios with their confidence intervals
cbind(OR1, CI1)

##                    OR1       2.5 %     97.5 %
## (Intercept) 0.01602022 0.006086805 0.03902357
## sexM        2.50925591 1.526894419 4.17886012
## absences    1.08592493 1.040580166 1.13727912
## failures    1.41283378 0.934759890 2.14797583
## goout       2.02986872 1.613471240 2.58531654

6. Predictions

In this part, we will explore the predictive power of a model with only statistically significant relationship with high alcohol consumption. Thus we will drop failures and fit a new model:

m2 <- glm(high_use ~ sex + absences + goout, data = alc, family = "binomial")

summary(m2)

## 
## Call:
## glm(formula = high_use ~ sex + absences + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7871  -0.8153  -0.5446   0.8357   2.4740  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.16317    0.47506  -8.764  < 2e-16 ***
## sexM         0.95872    0.25459   3.766 0.000166 ***
## absences     0.08418    0.02237   3.764 0.000167 ***
## goout        0.72981    0.11970   6.097 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 387.75  on 378  degrees of freedom
## AIC: 395.75
## 
## Number of Fisher Scoring iterations: 4

Now we will use the predict() function to make predictions from this model. We start by adding two columns to our data set: “probability” with the predicted probabilities and “prediction” which takes value TRUE if the value of “probability” is larger than 0.5.

# predict() the probability of high_use
probabilities <- predict(m2, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, sex, goout, high_use, probability, prediction) %>% tail(10)

##     absences sex goout high_use probability prediction
## 373        0   M     2    FALSE  0.14869987      FALSE
## 374        7   M     3     TRUE  0.39514446      FALSE
## 375        1   F     3    FALSE  0.13129452      FALSE
## 376        6   F     3    FALSE  0.18714923      FALSE
## 377        2   F     2    FALSE  0.07342805      FALSE
## 378        2   F     4    FALSE  0.25434555      FALSE
## 379        2   F     2    FALSE  0.07342805      FALSE
## 380        3   F     1    FALSE  0.03989428      FALSE
## 381        4   M     5     TRUE  0.68596604       TRUE
## 382        2   M     1     TRUE  0.09060457      FALSE

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)

##         prediction
## high_use FALSE TRUE
##    FALSE   253   15
##    TRUE     65   49

# tabulate the target variable versus the predictions with margings:
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()

##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000

From the 2x2 cross tabulation of predictions versus the actual values we can see that our model is not perfect: in 15 cases it predicted a respondent to be a heavy user, when in fact the user was not, and vice versa in 65 cases. On the other hand, in 302 cases the prediction is correct. From the second table we can see the share of correct and incorrect predictions. In 21 % of the cases the prediction was incorrect, which seems like a reasonable fit.

We can also see this in a graph that plots the true values against the predictions:

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

We still want to compute the total proportion of inaccurately classified individuals, or the the training error. For this purpose we will define a loss function and then calculate the average number of wrong predictions in the training data.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)

## [1] 0.2984293

loss_func(class = alc$high_use, prob = 1)

## [1] 0.7015707

loss_func(class = alc$high_use, alc$probability)

## [1] 0.2094241

That is, the training error is appr. 0.21 or 21 %. This seems like a reasonable fit.

7. Cross-validation

Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a loss function is computed on data not used for finding the model. Cross-validation gives a good sense of the actual predictive power of the model.

# 10-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]

## [1] 0.2172775

The average number of wrong predictions in the cross validation is 0.22 which is smaller than the 0.26 error in the model from DataCamp exercises. Thus my model has a slightly better test set performance than the example model.

Chapter 4: Clustering and classification

RStudio exercise 4: Data analysis and wrangling

Data analysis

2. Data

The Boston dataset contains data from a study on housing values in suburbs of Boston. With the data you can for example study the effect of crime rates or air quality on the value of owner-ocupied homes. You can learn more about the data here.

The dataset contains 506 observations on 14 variables. The observations are numeric with most variables taking natural number values. There are also some with integer values and one dichotomous variable.

# load the data
data("Boston")

# glimpse at the dataset
glimpse(Boston)

## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

3. Overview of the data

Next we will look at the data in a bit more detail. Start by printing out the column/variable names and writing down the definitions of the different variables (from here):

colnames(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Variable	Definition
crim	per capita crime rate by town
zn	proportion of residential land zoned for lots over 25,000 sq.ft
indus	proportion of non-retail business acres per town
chas	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox	nitrogen oxides concentration (parts per 10 million)
rm	average number of rooms per dwelling
age	proportion of owner-occupied units built prior to 1940
dis	weighted mean of distances to five Boston employment centres
rad	index of accessibility to radial highways
tax	full-value property-tax rate per $10,000
ptratio	pupil-teacher ratio by town
black	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat	lower status of the population (percent)
medv	median value of owner-occupied homes in $1000s

From the summary of the variables we can see minimum, maximum, median and mean values as well as the 1st and 3rd quartiles of the variables.

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We can study and visualize the correlations between the different variables with the help of a correlations matrix and a correlations plot.

# First calculate the correlation matrix and round it so that it includes only two digits:
cor_matrix<-cor(Boston) %>% round(digits = 2)

# Print the correlation matrix:
cor_matrix

##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

# Visualize the correlation matrix with a correlations plot:
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

From the plot above we can easily see which variables correlate with which and is that correlation positive or negative. So for example, we can see that the distance from the employment centres (dis) correlates negatively with age of the houses (age), nitrogen oxides concentration (nox) and proportion of non-retail business acres (indus). We can also see that accessibility to radial highways (rad) is highly positively correlated with property taxes (tax), as are industrial land use (indus) and air pollution (nox). On the first glimpse, the correlation appear quite intuitive and nothing stands out as very suprising. One could note that the correlation of the only dummy variable chas is almost negligible with all other variables. This could be because the overall share of observations where chas = 1 is very small.

4. Standardizing and scaling the dataset

We will later use linear discriminant analysis on our data, so we need to have varibles that are normally distributed and share the same covariance matrix across different classes. This requires the data to be scaled before fitting a model.

Thus we want to standardize the data set by scaling it. This is very easy to do in R by using the scale() function, which subtracts the mean of a variable from each value of the variable and divides this by the standard error. From the summary we can see that the standardized values are much more similar in magnitude across the different variables, so their relationships are easier to grasp. Note that the mean of each variable is now zero, as it should be. Also note that previously all variables were positive, but now at least the min values are negative for all variables. This is because the mean of all variables is larger than the smallest value of a variable.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Note that the scaled data set is now in matrix format. We want to change the class of the scaled data set into a dataframe, which we can do by converting it with a function as.data.frame():

# class of the boston_scaled object
class(boston_scaled)

## [1] "matrix"

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Next we want to create a categorical (or factor) variable of the crime rate in the scaled Boston data set. We will use the quantiles as the break points in the categorical variable. We will print the summary of the scaled crime rate, the quantile vector of this variable and a table of the new factor variable to check that everything goes as intended.

# summary of the scaled crime rate
summary(boston_scaled$crim)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins

##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)

## crime
##      low  med_low med_high     high 
##      127      126      126      127

In order to avoid confusion, we want to drop the old crime rate variable from the dataset and replace it with the new categorical variable for crime rates. This is easily done with the following two lines of code:

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Lastly for this step, we will divide the dataset to train and test sets. This will allow as further down the line to check how well our model works. We will assign 80 % of the data to the train set and the remaining 20 % to the test set. The training of the model is done with the train set and predictions on new data is conducted on the test set.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

5. Linear discriminant analysis

Now we will fit the linear discriminant analysis on the train set. The LDA is a classification method that finds the (linear) combination of the variables that separate the target variable classes. We will use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2673267 0.2326733 0.2574257 0.2425743 
## 
## Group means:
##                  zn      indus         chas        nox          rm
## low       0.8949519 -0.9062703 -0.126510540 -0.8639591  0.45121626
## med_low  -0.1002056 -0.3311926  0.062743293 -0.5786949 -0.13361340
## med_high -0.3874684  0.2113804  0.257665195  0.4054934  0.08666805
## high     -0.4872402  1.0149946  0.008892378  1.0154871 -0.45346741
##                 age        dis        rad        tax    ptratio
## low      -0.8661588  0.8290926 -0.6873107 -0.7344032 -0.4370892
## med_low  -0.3660405  0.3835351 -0.5432545 -0.4832759 -0.1057476
## med_high  0.4814412 -0.4204791 -0.4341408 -0.3260902 -0.3605327
## high      0.7900634 -0.8473921  1.6596029  1.5294129  0.8057784
##                black       lstat        medv
## low       0.37307325 -0.77176361  0.53243100
## med_low   0.31789570 -0.13216633  0.02523482
## med_high  0.07054389  0.04851603  0.17531453
## high     -0.85678592  0.87892067 -0.68679007
## 
## Coefficients of linear discriminants:
##                 LD1         LD2        LD3
## zn       0.11064629  0.50851943 -1.0794251
## indus    0.05698601 -0.30744412  0.2072748
## chas    -0.05340348 -0.04092194  0.2014044
## nox      0.13397336 -0.77053254 -1.2072462
## rm      -0.14887690 -0.07977527 -0.1655379
## age      0.27211892 -0.47851907 -0.1870665
## dis     -0.14504293 -0.22011357  0.3343980
## rad      3.68193482  0.90265651 -0.2031073
## tax      0.09624445  0.08772477  0.6807715
## ptratio  0.12104699  0.05061285 -0.3051478
## black   -0.15168803  0.02189388  0.1130713
## lstat    0.18716883 -0.20425751  0.4713960
## medv     0.17434448 -0.33140258 -0.1022168
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9548 0.0348 0.0104

The LDA calculates the probabilities of a new observation belonging to each of the classes by using the trained model. After that the model classifies each observation to the most probable class.

The most convenient and informative way to consider the results of an LDA is to visualize with a biplot. This we can do following the Datacamp exercise with the following code:

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The arrows in this plot represent the relationships of the original variables and the LDA solution.

6. LDA predicting performance

In this part we want to see how the LDA model performs when predicting on new data. For this end, we can use the predict() function. we start by predicting the crime classes with the test data.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)

##           predicted
## correct    low med_low med_high high
##   low       13       6        0    0
##   med_low    8      12       12    0
##   med_high   3       4       13    2
##   high       0       0        1   28

From the cross tabulation of the results we can see that the model appears to predict the correct results reasonably well. The model has most problems in separating med_low from low, but most of the predictions seems to fall in the correct class.

7. Clustering

Next we will take few steps towards clustering the data. Start by reloading the Boston dataset and standardize the dataset again. We need to do this again, because we have made also other changes to dataset.

# load the data
data("Boston")

# center and standardize variables
boston_scaled2 <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled2)

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

# class of the boston_scaled object
class(boston_scaled2)

## [1] "matrix"

# change the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled)

Next we wish to calculate the distances between the observations. We do this by forming a standard Euclidian distance matrix:

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)

## Warning in dist(boston_scaled2): NAs introduced by coercion

# look at the summary of the distances
summary(dist_eu)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1394  3.5267  4.9081  4.9394  6.2421 13.0045

I ran into trouble here because there are NA values in the boston_scaled2 distance matrix. The k-means algorithm wouldn’t run on this dataset, retuning an error on the NA values. Despite googling I was not able to find a solution, so I’m doing the rest of this exercise with the original Boston dataset (as in the Datacamp exercises). So, a novel try:

# euclidean distance matrix
dist_eu <- dist(Boston)

# look at the summary of the distances
summary(dist_eu)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047

From the summary we can see the relevant moments of the calculated distances between the observations.

Then let’s move on to K-means clustering, that assigns observations to groups or clusters based on similarity of the objects. The following code runs k-means algorithm on the dataset:

# k-means clustering
km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

It’s difficult to see anything from this, so let’s zoom in a bit and look at only the last five columns:

# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)

Next we want to know what the optimal number of clusters is. We can do this by looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.

# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

From the plot above it’s evident that optimal number of clusters is not three, but two. Let’s run the k-means algorithm again with two clusters and then visualize the results:

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)

Super-bonus

Start by running the code below for the (scaled) train data that we used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)

## [1] 404  13

dim(lda.fit$scaling)

## [1] 13  3

# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

## Warning: package 'plotly' was built under R version 3.4.2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Then we will use the package plotly to create a 3D plot of the columns of the matrix product:

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

This is very cool indeed! But addingdifferent colors for the different crime rate classes definetly makes it even cooler still:

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

Lastly, we are supposed to define the color by the clusters of the k-means. Here I must be doing something wrong, because the colors disappear completely.

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km)

Data wrangling

You can find my preprocessed data set and the data wrangling R script from my GitHub repository.

Chapter 5: Dimensionality reduction techniques

RStudio exercise 5: Data wrangling and analysis

Data wrangling

You can find my preprocessed data set and the data wrangling R script from my GitHub repository.

Analysis

1. Data

In this week assignment we are working with data from the the Human Development Index (HDI) and the Gender Inequality Index (GII), which are both compiled by the United Nations Development programme. You can learn more about the data here on HDI and here on GII.

The dataset we will use was combined and modified in the data wrangling part of this assignment.

We will start by loading the preprocessed “human” data into R and taking a look at it’s general appearance.

# load the data
human <- read.table(file = "human.csv", sep = ",", header = TRUE)

# glimpse at the dataset
glimpse(human)

## Observations: 155
## Variables: 8
## $ Edu2.FM   <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.969060...
## $ Labo.FM   <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.828611...
## $ Life.Exp  <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0...
## $ Edu.Exp   <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9...
## $ GNI       <int> 166, 135, 156, 139, 140, 137, 127, 154, 134, 117, 18...
## $ Mat.Mor   <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 1...
## $ Ado.Birth <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3...
## $ Parli.F   <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2...

From glimpsing the data we can see that it contains 155 observations on 8 variables. We can also see that all of the variables have numerical values. Every row in the data is associated with one country, thus we have observations from 155 different countries. From the webpages linked above we can see that most of the variables are observed in the year 2015, but the variable that measures the secondary education is measured over years 2005-2015.

Variable	Definition
Edu2.FM	The ratio between the shares of female and male population with at least some secondary education
Labo.FM	The ratio between the shares of female and male labour force participation rate
Edu.Exp	Expected years of schooling, years
Life.Exp	Life expectancy at birth, years
GNI	Gross national income (GNI) per capita, 2011 PPP $
Mat.Mor	Maternal mortality ratio, deaths per 100,000 live births
Ado.Birth	Adolescent birth rate, births per 1,000 women ages 15-19
Parli.F	Share of seats in parliament, % held by women

2. Overview of the data

From the summary of the variables we can see minimum, maximum, median and mean values as well as the 1st and 3rd quartiles of the variables. By looking at the minimum and maximum values we can see that the countries in our data set differ quite a lot with respect to all eight variables.

summary(human)

##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :  2.00   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.: 53.50   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 99.00   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 98.73   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.:143.50   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :194.00   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Now, in order to gain an oversight on the distributions of the variables and the relationships between the different variables, we can draw a ggpairs-plot. This plot contains so much information! We can for example immediately see, that most of the countries in our sample mostly exhibit very low rates of maternal mortality. Only the distribution of expected years of education appears to be quite close to the normal distribution, the distributions of other variables are very clearly skewed or have high kurtosis such as the distribution of Edu2.FM.

ggpairs(human)

From the above plot we can study the correlations between the variables, but we can do this another way also. We can study and visualize the correlations between the different variables with the help of a correlations matrix and a correlations plot.

# First calculate the correlation matrix and round it so that it includes only two digits:
cor_matrix<-cor(human) %>% round(digits = 2)

# Print the correlation matrix:
cor_matrix

##           Edu2.FM Labo.FM Life.Exp Edu.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01     0.58    0.59  0.18   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    -0.14    0.05 -0.01    0.24      0.12    0.25
## Life.Exp     0.58   -0.14     1.00    0.79  0.19   -0.86     -0.73    0.17
## Edu.Exp      0.59    0.05     0.79    1.00  0.12   -0.74     -0.70    0.21
## GNI          0.18   -0.01     0.19    0.12  1.00   -0.10     -0.13    0.00
## Mat.Mor     -0.66    0.24    -0.86   -0.74 -0.10    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12    -0.73   -0.70 -0.13    0.76      1.00   -0.07
## Parli.F      0.08    0.25     0.17    0.21  0.00   -0.09     -0.07    1.00

# Visualize the correlation matrix with a correlations plot:
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

From the plot above, we can clearly and easily see the correlations between the variables. Note for example that expected years of education are highly positively correlated with life expectancy. This most probably is related to the fact that these both are correlated positively with the countries’ GNI. Also maternal mortality and adolescent birth rate are strongly positively correlated, but here the undertone is completely the opposite and of more sinister nature. In those countries, were adolescent girls are more likely to be mothers, i.e. they don’t have access to contraceptives or are forced to marry young, expectant mothers are also less likely to have access to good quality health care. These countries are likely to be low-income countries, which is also confirmed by the correlations plot. All in all, it’s clear from the plot that high maternal mortality and adolescent birth rate are signals of other problems in a country also. Those countries also have lower education and life expectancies, less income per capita and less gender equality in education. Interestingly, share of women in the parliament and the difference between labor force participation between the sexes appear to be not that much correlated with the other variables.

3. Principal component analysis on non-standardized data

In this sectin we will performa principal component analysis (PCA) using the singular value decomposition (SVD), which is the most important tool for reducing the number of dimensions in multivariate data. Crucially, we will perform PCA on non-standardized data. Recall from part 2 that the values taken by the variables are very different in magnitude. Especially the GNI takes very large values compared to the other variables. We will see why the standardization of data is so important when using PCA.

Recall that in PCA the first principal component captures the maximum amout variance from the the features in the orignal data, the second PC captures the maximum amount of variability left from the first PC, the third captures the maximum amount of variability left from the second PC and so on. All principal components are uncorrelated and each successive component is less important in terms of captured variance.

By printing the results of the PCA we can see that standard deviations (or variablity) of the principal components are very different in magnitude. From the summary of the PCA results and the percentages of variance captured by each PC we can see that almost all of the variability in the original features is captured by the first principal component. A good hunch is that we will run into trouble in plotting the biplot.

# perform principal component analysis (with the SVD method) and print out the results
pca_human <- prcomp(human)
pca_human

## Standard deviations (1, .., p=8):
## [1] 214.3202186  54.4858892  26.3814123  11.4791149   4.0668732   1.6067062
## [7]   0.1905111   0.1586732
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4
## Edu2.FM   -0.0007468424  4.707363e-04 -0.0001689810 -0.0004252438
## Labo.FM    0.0002210928  5.783875e-05 -0.0007563442 -0.0047679411
## Life.Exp  -0.0334485095  1.573401e-02 -0.0303860782 -0.0780112042
## Edu.Exp   -0.0098109042  2.441895e-03 -0.0223269972 -0.0369246053
## GNI       -0.0278166075  9.979413e-01  0.0557767706 -0.0002740466
## Mat.Mor    0.9879825869  3.620190e-02 -0.1474236772 -0.0068326906
## Ado.Birth  0.1479114018 -5.046353e-02  0.9867754042 -0.0069374879
## Parli.F   -0.0048164618 -1.494485e-03 -0.0026652825 -0.9962093264
##                     PC5           PC6           PC7           PC8
## Edu2.FM    0.0002964017 -0.0254740423  6.859097e-01  7.272399e-01
## Labo.FM   -0.0034895150 -0.0320024527  7.265303e-01 -6.863629e-01
## Life.Exp  -0.9759219723  0.1979058277  5.461806e-03  2.081465e-03
## Edu.Exp   -0.1947276781 -0.9790080609 -4.054956e-02  3.992920e-03
## GNI        0.0150268685 -0.0014692010 -3.631899e-04 -3.767644e-04
## Mat.Mor   -0.0282554922 -0.0005722709  1.213189e-04  8.299757e-04
## Ado.Birth -0.0393039689 -0.0160315111 -4.359137e-05 -9.463203e-05
## Parli.F    0.0841200882  0.0210694313 -2.695182e-03  2.658636e-03

# create and print out a summary of pca_human
s <- summary(pca_human)
s

## Importance of components%s:
##                             PC1      PC2      PC3      PC4     PC5     PC6
## Standard deviation     214.3202 54.48589 26.38141 11.47911 4.06687 1.60671
## Proportion of Variance   0.9233  0.05967  0.01399  0.00265 0.00033 0.00005
## Cumulative Proportion    0.9233  0.98298  0.99697  0.99961 0.99995 1.00000
##                           PC7    PC8
## Standard deviation     0.1905 0.1587
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion  1.0000 1.0000

# calculate rounded percentages of variance captured by each PC and print out the percentages of variance
pca_pr <- round(100*s$importance[2,], digits = 1) 
pca_pr

##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 92.3  6.0  1.4  0.3  0.0  0.0  0.0  0.0

This is indeed the case. Recall that a biplot includes a scatter plot between the first two PC’s and the original features and their relationships with both each other and the principal components. As all variability is captured by PC1 and none by PC2, the realtionships between the original features and the principal components do not really show. Indeed, as the length of the arrows in the biplot are proportional to the standard deviations of the original features, we only see in the biplot an arrow representing GNI, that took values many times larger than the other variables. Summa summarum, there isn’t much we can learn from this plot.

# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "Biplot of the two principal components of the unscaled human data")

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

4. Principal component analysis on standardized data

Next we will perform similar analysis on the standardized data.

Start by standardizing the data. This is very easily done using the scale() function. Recall that standardizing is equal to taking the difference between the observation of the variable and the mean of the variable and dividing it by the standard deviation of the variable. From the summary of the standardized variables we can now see that the values of each variable are much more in line with each other than previously.

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)

##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI               Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-1.767729   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.826563   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median : 0.004952   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.818192   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 1.741082   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

Now we can perform the principal component analysis and get some sensible results. From the results we can immediately see that they are comparable. We can also see that the variability captured by each of the PC is much more evenly distributed now. Still, it is clear that PC1 is the most important one, as it captures 53,6 % of the variance in the original features.

# perform principal component analysis (with the SVD method) and print out the results
pca_human_std <- prcomp(human_std)
pca_human_std

## Standard deviations (1, .., p=8):
## [1] 1.9658091 1.1387842 0.9899881 0.8659777 0.6993077 0.5400078 0.4670082
## [8] 0.3317153
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4         PC5
## Edu2.FM   -0.38438136  0.04973416 -0.08668975  0.31301506 -0.82578999
## Labo.FM    0.06452258  0.72023430 -0.10293459  0.59483194  0.20499428
## Life.Exp  -0.46758272 -0.01554845  0.02106889 -0.09983270  0.19702138
## Edu.Exp   -0.44611813  0.14839516  0.06836143  0.09857478  0.22718207
## GNI       -0.11121526 -0.01842917 -0.97298099 -0.16503899  0.07296310
## Mat.Mor    0.47036562  0.12782929 -0.11999280  0.04185463  0.04609075
## Ado.Birth  0.43435393  0.06675352 -0.06553107 -0.05173116 -0.38364001
## Parli.F   -0.09031518  0.65984104  0.10671258 -0.70487388 -0.17604400
##                    PC6          PC7         PC8
## Edu2.FM    0.139911426  0.100381612  0.18084014
## Labo.FM    0.008135958 -0.256990684 -0.06742229
## Life.Exp  -0.393884789 -0.421527811  0.63171658
## Edu.Exp   -0.415647904  0.716119209 -0.16542557
## GNI        0.005876052  0.009858605 -0.08891751
## Mat.Mor    0.085633294  0.472803692  0.71642517
## Ado.Birth -0.792816131 -0.096488804 -0.12191407
## Parli.F    0.128550324  0.020403204 -0.01688909

# create and print out a summary of pca_human
s_std <- summary(pca_human_std)
s_std

## Importance of components%s:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.966 1.1388 0.9900 0.86598 0.69931 0.54001 0.46701
## Proportion of Variance 0.483 0.1621 0.1225 0.09374 0.06113 0.03645 0.02726
## Cumulative Proportion  0.483 0.6452 0.7677 0.86140 0.92253 0.95898 0.98625
##                            PC8
## Standard deviation     0.33172
## Proportion of Variance 0.01375
## Cumulative Proportion  1.00000

# calculate rounded percentages of variance captured by each PC and print out the percentages of variance
pca_pr_std <- round(100*s_std$importance[2,], digits = 1) 
pca_pr_std

##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 48.3 16.2 12.3  9.4  6.1  3.6  2.7  1.4

Let’s see how our results now look in the biplot.

# create object pc_lab to be used as axis labels
pc_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")

# draw a biplot
biplot(pca_human_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_std[1], ylab = pc_lab_std[2], main = "Biplot of the two principal components of the scaled human data")

From the two biplots above we can clearly see that standardizing the variables has a big effect on the results. The first biplot with unscaled data gave us very little information on the original features. Recall that PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance. This is why standardizing the data is so important.

Our data set has observations on 8 different variables, that is, it has 8 dimensions. Just by looking at all the different plots in say the ggpairs plot above, it could be very hard to tease out interesting details in the data. Not that much stands out when you have a heapload of scatterplots! But with PCA and a biplot, we can e.g. note that there is something different about say Rwanda, that is a lonely outlier in out plot. We can also immediately see that three of our eight variables (Parl.F, Labo.FM and GNI) are not correlated with the other variables and the first PC, but instead they are highly positively correlated with the second PC (the angle between the arrows and the PC2 axis is very small). The five other variables are correlated with the first PC.

5. Interpretating the two principal components

From the second biplot we can clearly see a confirmation for what we saw in the correlations plot: share of female parliamentary representation, differences in female and male labor force participation and GNI per capita are not really correlated with the other five variables. Also note that the length of the arrow of GNI is much shorter than the arrows of the others (recall that the length of the arrows is proportional to the standard deviations of the features).

We can also see that the these three variables are quite clearly correlated with the second PC, whereas the other six are correlated with the first PC. Also note that the length of the arrows of these six variables is in general longer than that of the three correlated with PC2. This reflects the difference in their relative importance.

6. Multiple Correspondence Analysis

Load the tea dataset from the package Factominer. Explore the data briefly: look at the structure and the dimensions of the data and visualize it. Then do Multiple Correspondence Analysis on the tea data (or to a certain columns of the data, it’s up to you). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots.

Start by installing the Factominer package and accessing the library (code not shown here). Then download the “tea” dataset and take a glimpse on the data. We have 300 observations on 36 variables. Note that all the variables are factor or categorical variables, which is essential for MCA.

# load the data
data("tea")

# glimpse at the dataset
glimpse(tea)

## Observations: 300
## Variables: 36
## $ breakfast        <fctr> breakfast, breakfast, Not.breakfast, Not.bre...
## $ tea.time         <fctr> Not.tea time, Not.tea time, tea time, Not.te...
## $ evening          <fctr> Not.evening, Not.evening, evening, Not.eveni...
## $ lunch            <fctr> Not.lunch, Not.lunch, Not.lunch, Not.lunch, ...
## $ dinner           <fctr> Not.dinner, Not.dinner, dinner, dinner, Not....
## $ always           <fctr> Not.always, Not.always, Not.always, Not.alwa...
## $ home             <fctr> home, home, home, home, home, home, home, ho...
## $ work             <fctr> Not.work, Not.work, work, Not.work, Not.work...
## $ tearoom          <fctr> Not.tearoom, Not.tearoom, Not.tearoom, Not.t...
## $ friends          <fctr> Not.friends, Not.friends, friends, Not.frien...
## $ resto            <fctr> Not.resto, Not.resto, resto, Not.resto, Not....
## $ pub              <fctr> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub,...
## $ Tea              <fctr> black, black, Earl Grey, Earl Grey, Earl Gre...
## $ How              <fctr> alone, milk, alone, alone, alone, alone, alo...
## $ sugar            <fctr> sugar, No.sugar, No.sugar, sugar, No.sugar, ...
## $ how              <fctr> tea bag, tea bag, tea bag, tea bag, tea bag,...
## $ where            <fctr> chain store, chain store, chain store, chain...
## $ price            <fctr> p_unknown, p_variable, p_variable, p_variabl...
## $ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 3...
## $ sex              <fctr> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M,...
## $ SPC              <fctr> middle, middle, other worker, student, emplo...
## $ Sport            <fctr> sportsman, sportsman, sportsman, Not.sportsm...
## $ age_Q            <fctr> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35...
## $ frequency        <fctr> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, ...
## $ escape.exoticism <fctr> Not.escape-exoticism, escape-exoticism, Not....
## $ spirituality     <fctr> Not.spirituality, Not.spirituality, Not.spir...
## $ healthy          <fctr> healthy, healthy, healthy, healthy, Not.heal...
## $ diuretic         <fctr> Not.diuretic, diuretic, diuretic, Not.diuret...
## $ friendliness     <fctr> Not.friendliness, Not.friendliness, friendli...
## $ iron.absorption  <fctr> Not.iron absorption, Not.iron absorption, No...
## $ feminine         <fctr> Not.feminine, Not.feminine, Not.feminine, No...
## $ sophisticated    <fctr> Not.sophisticated, Not.sophisticated, Not.so...
## $ slimming         <fctr> No.slimming, No.slimming, No.slimming, No.sl...
## $ exciting         <fctr> No.exciting, exciting, No.exciting, No.excit...
## $ relaxing         <fctr> No.relaxing, No.relaxing, relaxing, relaxing...
## $ effect.on.health <fctr> No.effect on health, No.effect on health, No...

We will follow the DataCamp exercises and restrict our analysis on six columns in the dataset. We will thus have have 300 observations on 6 variables.

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
glimpse(tea_time)

## Observations: 300
## Variables: 6
## $ Tea   <fctr> black, black, Earl Grey, Earl Grey, Earl Grey, Earl Gre...
## $ How   <fctr> alone, milk, alone, alone, alone, alone, alone, milk, m...
## $ how   <fctr> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag, t...
## $ sugar <fctr> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar, N...
## $ where <fctr> chain store, chain store, chain store, chain store, cha...
## $ lunch <fctr> Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch, ...

str(tea_time)

## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...

summary(tea_time)

##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
##

We can visualize the data using bar plots. From the plots we can for example note that most tea drinkers surveyed here completely miss the point of drinking tea: Earl Grey from a tea bag bought from a chain store is basically an insult to tea.

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Next we want to do a Multiple Correspondence Analysis (MCA) on the tea data. Recall that MCA is a method to analyze qualitative data and it can be used to detect patterns or structure in the data as well as in dimension reduction.

The MCA will produce a matrix of two-way cross-tabulations between all the variables in the dataset, and then present the information of the cross-tabulations in a clear graphical form. In the MCA factor maps the variables are drawn on the first two dimensions discovered in the MCA. In interpreting the graphical results recall that the distance between the variable categories gives a measure of their similarity.

The default plots in a sense make groups of the observations and gove some indication on their relationships. In the second plot we have all the observations plotted. In the first plot these are categorized according to the value of the observation. In the third plot the picture is simplified even more by combining all the observations on a single variable into one data point. From the default plots we can e.g. see that variables “where” (where is tea bought from) and “how” (is tea bought in bags or unpackaged form) are closely related, which is not surprising. These two variables are however distinct from the other four variables.

# multiple correspondence analysis
mca <- MCA(tea_time)

# summary of the model
summary(mca)

## 
## Call:
## MCA(X = tea_time) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

Next we can make the first of the above plots a bit more informative by adding the argument habillage = “quali”. This argument adds a color indicating the observations that are on the same variable. From this plot we can easily see for example the relationship between where tea is bought and in what form it is bought.

# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

IODS course project

Anni Norring

Chapter 1: Tools and methods for open and reproducible research

RStudio Exercise 1: About the project

Introduction to Open Data Science

My GitHub repository

Chapter 2: Regression and model validation

RStudio exercise 2: Data wrangling and analysis

Data wrangling

Analysis

1. Data

2. A more detailed view on the data

A graphical overview:

Summary of the variables

3. A regression model

4. Interpreting the regression results

5. Diagnostic plots

Chapter 3: Logistic regression

RStudio exercise 3: Data wrangling and analysis

Data wrangling

Analysis

2. Data

3. Hypothesis

4. The distributions of the chosen variables and their relationships with alcohol consumption

Bar plots

Cross-tabulations

Box plots

5. Logistic regression

Odds ratios and confidence intervals

6. Predictions

7. Cross-validation

Chapter 4: Clustering and classification

RStudio exercise 4: Data analysis and wrangling

Data analysis

2. Data

3. Overview of the data

4. Standardizing and scaling the dataset

5. Linear discriminant analysis

6. LDA predicting performance

7. Clustering

Super-bonus

Data wrangling

Chapter 5: Dimensionality reduction techniques

RStudio exercise 5: Data wrangling and analysis

Data wrangling

Analysis

1. Data

2. Overview of the data

3. Principal component analysis on non-standardized data

4. Principal component analysis on standardized data

5. Interpretating the two principal components

6. Multiple Correspondence Analysis