Question 1 : Write an R script to do the following:
a) simulate a sample of 100 random data points from a normal distribution with mean 100 and
standard deviation 5 and store the result in a vector.
b) visualize the vector created above using different plots.
c) test the hypothesis that the mean equals 100.
d) use wilcox test to test the hypothesis that mean equals 90
Sure! Here’s an R script that accomplishes the tasks you mentioned:
# Set the seed for reproducibility set.seed(123) # a) Simulate a sample of 100 random data points from a normal distribution data <- rnorm(100, mean = 100, sd = 5) # b) Visualize the vector using different plots # Histogram hist(data, main = "Histogram of Data", xlab = "Value", ylab = "Frequency") # Boxplot boxplot(data, main = "Boxplot of Data", ylab = "Value") # c) Test the hypothesis that the mean equals 100 t.test(data, mu = 100) # d) Use Wilcoxon test to test the hypothesis that the mean equals 90 wilcox.test(data, mu = 90)
In this script, we first use the rnorm
function to generate a sample of 100 random data points from a normal distribution with a mean of 100 and a standard deviation of 5. We store the result in the data
vector.
Then, we proceed to visualize the data
vector using different plots. The script includes a histogram and a boxplot to provide a visual representation of the data.
Next, we test the hypothesis that the mean of the data equals 100 using the t.test
function. This function performs a one-sample t-test, comparing the mean of the data to the specified null hypothesis value of 100.
Finally, we use the wilcox.test
function to perform a Wilcoxon signed-rank test. This test is used to compare the distribution of the data to a hypothesized median value of 90.
Feel free to run this script in an R environment to see the results and explore the visualizations.
Question 2 : Using the Algae data set from package DMwR to complete the following tasks.
a) create a graph that you find adequate to show the distribution of the values of algae a6.
b) show the distribution of the values of size 3.
c) check visually if oPO4 follows a normal distribution.
d) produce a graph that allows you to understand how the values of NO3 are distributed
across the sizes of river.
e) using a graph check if the distribution of algae a1 varies with the speed of the river.
f) visualize the relationship between the frequencies of algae a1 and a6. Give the
appropriate graph title, x-axis and y-axis title.
To complete the tasks using the Algae dataset from the DMwR package in R, you need to install and load the package first. Here’s an R script that accomplishes the tasks:
# Install and load the required package install.packages("DMwR") library(DMwR) # Load the Algae dataset data(Algae) # a) Create a graph to show the distribution of the values of algae a6 hist(Algae$a6, main = "Distribution of Algae a6", xlab = "Algae a6 Value", ylab = "Frequency") # b) Show the distribution of the values of size 3 hist(Algae[Algae$size == 3, "a6"], main = "Distribution of Algae a6 for Size 3", xlab = "Algae a6 Value", ylab = "Frequency") # c) Check visually if oPO4 follows a normal distribution hist(Algae$oPO4, main = "Distribution of oPO4", xlab = "oPO4 Value", ylab = "Frequency") # d) Produce a graph to understand how the values of NO3 are distributed across the sizes of the river boxplot(Algae$NO3 ~ Algae$riverSize, main = "Distribution of NO3 across River Sizes", xlab = "River Size", ylab = "NO3 Value") # e) Check if the distribution of algae a1 varies with the speed of the river boxplot(Algae$a1 ~ Algae$riverSpeed, main = "Distribution of Algae a1 across River Speeds", xlab = "River Speed", ylab = "Algae a1 Value") # f) Visualize the relationship between the frequencies of algae a1 and a6 plot(Algae$a1, Algae$a6, main = "Relationship between Algae a1 and a6", xlab = "Algae a1", ylab = "Algae a6")
In this script, we install the DMwR
package, load it into the R session, and then load the Algae dataset using the data
function.
For task a), we create a histogram to show the distribution of the values of algae a6 using the hist
function.
For task b), we create a histogram specifically for the values of algae a6 where the size is 3.
For task c), we create a histogram to visually check if the variable oPO4 follows a normal distribution.
For task d), we use a boxplot to understand how the values of NO3 are distributed across the different sizes of the river.
For task e), we use a boxplot to check if the distribution of algae a1 varies with the speed of the river.
For task f), we create a scatter plot to visualize the relationship between the frequencies of algae a1 and a6. The graph title, x-axis, and y-axis titles are provided.
You can run this script in an R environment with the DMwR
package installed to see the graphs and explore the distributions of the variables in the Algae dataset.
Question 3 : Read the file Coweeta.CSV and write an R script to do the following:
a) count the number of observations per species.
b) take a subset of the data including only those species with at least 10 observations.
c) make a scatter plot of biomass versus height, with the symbol colour varying by species,
and use filled squares for the symbols. Also add a title to the plot, in italics.
d) log-transform biomass, and redraw the plot.
To accomplish the tasks using the Coweeta.CSV file in R, you can follow this R script:
# Read the CSV file data <- read.csv("Coweeta.CSV") # a) Count the number of observations per species species_counts <- table(data$species) # b) Subset the data to include only species with at least 10 observations filtered_data <- subset(data, species %in% names(species_counts[species_counts >= 10])) # c) Make a scatter plot of biomass versus height, with symbol color varying by species and using filled squares plot(filtered_data$height, filtered_data$biomass, pch = 15, col = filtered_data$species, main = italic("Scatter Plot of Biomass versus Height"), xlab = "Height", ylab = "Biomass") # d) Log-transform biomass and redraw the plot filtered_data$log_biomass <- log(filtered_data$biomass) plot(filtered_data$height, filtered_data$log_biomass, pch = 15, col = filtered_data$species, main = italic("Scatter Plot of Log-Transformed Biomass versus Height"), xlab = "Height", ylab = "Log-Transformed Biomass")
In this script, we first read the Coweeta.CSV file using the read.csv
function and store the data in the data
variable.
For task a), we use the table
function to count the number of observations per species. The result is stored in the species_counts
variable.
For task b), we subset the data to include only those species with at least 10 observations using the subset
function and the %in%
operator. The resulting filtered data is stored in the filtered_data
variable.
For task c), we create a scatter plot of biomass versus height using the plot
function. We use filled squares (pch = 15
) as symbols and vary the symbol color by species (col = filtered_data$species
). The plot is given a title using the main
argument, with the title text specified as italic. The x-axis label is set as “Height”, and the y-axis label is set as “Biomass”.
For task d), we log-transform the biomass values by applying the log
function to the biomass
variable in the filtered_data
dataset. We then redraw the scatter plot using the log-transformed biomass values, following the same steps as in task c).
Make sure to update the file path in the read.csv
function to the correct location of the Coweeta.CSV file on your system. You can run this script in an R environment to perform the specified tasks and visualize the scatter plots.